Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
1.
PLoS Comput Biol ; 19(7): e1011272, 2023 07.
Artículo en Inglés | MEDLINE | ID: mdl-37471333

RESUMEN

Some scientific studies involve huge amounts of bioinformatics data that cannot be analyzed on personal computers usually employed by researchers for day-to-day activities but rather necessitate effective computational infrastructures that can work in a distributed way. For this purpose, distributed computing systems have become useful tools to analyze large amounts of bioinformatics data and to generate relevant results on virtual environments, where software can be executed for hours or even days without affecting the personal computer or laptop of a researcher. Even if distributed computing resources have become pivotal in multiple bioinformatics laboratories, often researchers and students use them in the wrong ways, making mistakes that can cause the distributed computers to underperform or that can even generate wrong outcomes. In this context, we present here ten quick tips for the usage of Apache Spark distributed computing systems for bioinformatics analyses: ten simple guidelines that, if taken into account, can help users avoid common mistakes and can help them run their bioinformatics analyses smoothly. Even if we designed our recommendations for beginners and students, they should be followed by experts too. We think our quick tips can help anyone make use of Apache Spark distributed computing systems more efficiently and ultimately help generate better, more reliable scientific results.


Asunto(s)
Biología Computacional , Programas Informáticos , Humanos , Biología Computacional/métodos , Computadores , Redes de Comunicación de Computadores
2.
BMC Bioinformatics ; 23(1): 474, 2022 Nov 11.
Artículo en Inglés | MEDLINE | ID: mdl-36368948

RESUMEN

BACKGROUND: Huge amounts of molecular interaction data are continuously produced and stored in public databases. Although many bioinformatics tools have been proposed in the literature for their analysis, based on their modeling through different types of biological networks, several problems still remain unsolved when the problem turns on a large scale. RESULTS: We propose DIAMIN, that is, a high-level software library to facilitate the development of applications for the efficient analysis of large-scale molecular interaction networks. DIAMIN relies on distributed computing, and it is implemented in Java upon the framework Apache Spark. It delivers a set of functionalities implementing different tasks on an abstract representation of very large graphs, providing a built-in support for methods and algorithms commonly used to analyze these networks. DIAMIN has been tested on data retrieved from two of the most used molecular interactions databases, resulting to be highly efficient and scalable. As shown by different provided examples, DIAMIN can be exploited by users without any distributed programming experience, in order to perform various types of data analysis, and to implement new algorithms based on its primitives. CONCLUSIONS: The proposed DIAMIN has been proved to be successful in allowing users to solve specific biological problems that can be modeled relying on biological networks, by using its functionalities. The software is freely available and this will hopefully allow its rapid diffusion through the scientific community, to solve both specific data analysis and more complex tasks.


Asunto(s)
Biología Computacional , Programas Informáticos , Biología Computacional/métodos , Algoritmos , Bases de Datos Factuales , Biblioteca de Genes
4.
Bioinformatics ; 38(4): 925-932, 2022 01 27.
Artículo en Inglés | MEDLINE | ID: mdl-34718420

RESUMEN

MOTIVATION: Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e. their ability to identify true similarity, has been limited to some members of the D2 family. The corresponding experimental studies have concentrated on short sequences, a scenario no longer adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either missing or limited. RESULTS: By concentrating on a representative set of word-frequency-based AF functions, we perform the first coherent and uniform evaluation of the power, involving also Type I error for completeness. Two alternative models of important genomic features (CIS Regulatory Modules and Horizontal Gene Transfer), a wide range of sequence lengths from a few thousand to millions, and different values of k have been used. As a result, we provide a characterization of those AF functions that is novel and informative. Indeed, we identify weak and strong points of each function considered, which may be used as a guide to choose one for analysis tasks. Remarkably, of the 15 functions that we have considered, only four stand out, with small differences between small and short sequence length scenarios. Finally, to encourage the use of our methodology for validation of future AF functions, the Big Data platform supporting it is public. AVAILABILITY AND IMPLEMENTATION: The software is available at: https://github.com/pipp8/power_statistics. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Programas Informáticos , Análisis de Secuencia , Genómica
5.
Thyroid ; 31(12): 1814-1821, 2021 12.
Artículo en Inglés | MEDLINE | ID: mdl-34541894

RESUMEN

Background: The role of minimal extrathyroidal extension (mETE) as a risk factor for persistent papillary thyroid carcinoma (PTC) is still debated. The aims of this study were to assess the clinical impact of mETE as a predictor of worse initial treatment response in PTC patients and to verify the impact of radioiodine therapy after surgery in patients with mETE. Methods: We reviewed all records in the Italian Thyroid Cancer Observatory database and selected 2237 consecutive patients with PTC who satisfied the inclusion criteria (PTC with no lymph node metastases and at least 1 year of follow-up). For each case, we considered initial surgery, histological variant of PTC, tumor diameter, recurrence risk class according to the American Thyroid Association (ATA) risk stratification system, use of radioiodine therapy, and initial therapy response, as suggested by ATA guidelines. Results: At 1-year follow-up, 1831 patients (81.8%) had an excellent response, 296 (13.2%) had an indeterminate response, 55 (2.5%) had a biochemical incomplete response, and 55 (2.5%) had a structural incomplete response. Statistical analysis suggested that mETE (odds ratio [OR] 1.16, p = 0.65), tumor size >2 cm (OR 1.45, p = 0.34), aggressive PTC histology (OR 0.55, p = 0.15), and age at diagnosis (OR 0.90, p = 0.32) were not significant risk factors for a worse initial therapy response. When evaluating the combination of mETE, tumor size, and aggressive PTC histology, the presence of mETE with a >2 cm tumor was significantly associated with a worse outcome (OR 5.27 [95% confidence interval], p = 0.014). The role of radioiodine ablation in patients with mETE was also evaluated. When considering radioiodine treatment, propensity score-based matching was performed, and no significant differences were found between treated and nontreated patients (p = 0.24). Conclusions: This study failed to show the prognostic value of mETE in predicting initial therapy response in a large cohort of PTC patients without lymph node metastases. The study suggests that the combination of tumor diameter and mETE can be used as a reliable prognostic factor for persistence and could be easily applied in clinical practice to manage PTC patients with low-to-intermediate risk of recurrent/persistent disease.


Asunto(s)
Cáncer Papilar Tiroideo/patología , Neoplasias de la Tiroides/patología , Adulto , Femenino , Humanos , Radioisótopos de Yodo , Estudios Longitudinales , Masculino , Persona de Mediana Edad , Estudios Prospectivos , Cáncer Papilar Tiroideo/terapia , Neoplasias de la Tiroides/terapia , Tiroidectomía
6.
BMC Bioinformatics ; 22(1): 144, 2021 Mar 22.
Artículo en Inglés | MEDLINE | ID: mdl-33752596

RESUMEN

BACKGROUND: Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. RESULTS: We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in better space savings, and even in better execution times over compressed data, with respect to the use of generic compressors available in Hadoop, in particular for FASTQ files. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System. CONCLUSIONS: Our Methods and the corresponding software substantially contribute to achieve space and time savings for the storage and processing of FASTA/Q files in Hadoop and Spark. Being our approach general, it is very likely that it can be applied also to FASTA/Q compression methods that will appear in the future. AVAILABILITY: The software and the datasets are available at https://github.com/fpalini/fastdoopc.


Asunto(s)
Compresión de Datos , Genómica , Programas Informáticos , Algoritmos , Macrodatos
7.
Bioinformatics ; 37(12): 1658-1665, 2021 Jul 19.
Artículo en Inglés | MEDLINE | ID: mdl-33471066

RESUMEN

MOTIVATION: Alignment-free distance and similarity functions (AF functions, for short) are a well-established alternative to pairwise and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in computational biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. RESULTS: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (i) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (ii) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (iii) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE. AVAILABILITYAND IMPLEMENTATION: The software and the datasets are available at https://github.com/fpalini/fade. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

8.
Thyroid ; 31(2): 264-271, 2021 02.
Artículo en Inglés | MEDLINE | ID: mdl-32475305

RESUMEN

Background: One of the most widely used risk stratification systems for estimating individual patients' risk of persistent or recurrent differentiated thyroid cancer (DTC) is the American Thyroid Association (ATA) guidelines. The 2015 ATA version, which has increased the number of patients considered at low or intermediate risk, has been validated in several retrospective, single-center studies. The aims of this study were to evaluate the real-world performance of the 2015 ATA risk stratification system in predicting the response to treatment 12 months after the initial treatment and to determine the extent to which this performance is affected by the treatment center in which it is used. Methods: A prospective cohort of DTC patients collected by the Italian Thyroid Cancer Observatory web-based database was analyzed. We reviewed all records present in the database and selected consecutive cases that satisfied inclusion criteria: (i) histological diagnosis of DTC, with the exclusion of noninvasive follicular thyroid neoplasm with papillary-like nuclear features; (ii) complete data of the initial treatment and pathological features; and (iii) results of 1-year follow-up visit (6-18 months after the initial treatment), including all data needed to classify the estimated response to treatment. Results: The final cohort was composed of 2071 patients from 40 centers. The ATA risk of persistent/recurrent disease was classified as low in 1109 patients (53.6%), intermediate in 796 (38.4%), and high in 166 (8.0%). Structural incomplete responses were documented in only 86 (4.2%) patients: 1.5% in the low-risk, 5.7% in the intermediate-risk, and 14.5% in the high-risk group. The baseline ATA risk class proved to be a significant predictor of structural persistent disease, both for intermediate-risk (odds ratio [OR] 4.67; 95% confidence interval [CI] 2.59-8.43) and high-risk groups (OR 16.48; CI 7.87-34.5). Individual center did not significantly influence the prediction of the 1-year disease status. Conclusions: The ATA risk stratification system is a reliable predictor of short-term outcomes in patients with DTC in real-world clinical settings characterized by center heterogeneity in terms of size, location, level of care, local management strategies, and resource availability.


Asunto(s)
Diferenciación Celular , Técnicas de Apoyo para la Decisión , Radioisótopos de Yodo/uso terapéutico , Escisión del Ganglio Linfático , Radiofármacos/uso terapéutico , Neoplasias de la Tiroides/terapia , Tiroidectomía , Adulto , Bases de Datos Factuales , Femenino , Humanos , Radioisótopos de Yodo/efectos adversos , Italia , Escisión del Ganglio Linfático/efectos adversos , Masculino , Persona de Mediana Edad , Recurrencia Local de Neoplasia , Valor Predictivo de las Pruebas , Estudios Prospectivos , Radiofármacos/efectos adversos , Medición de Riesgo , Factores de Riesgo , Neoplasias de la Tiroides/diagnóstico por imagen , Neoplasias de la Tiroides/patología , Tiroidectomía/efectos adversos , Factores de Tiempo , Resultado del Tratamiento
9.
J Comput Biol ; 28(3): 283-295, 2021 03.
Artículo en Inglés | MEDLINE | ID: mdl-33103913

RESUMEN

We discuss the challenge of comparing three gene prioritization methods: network propagation, integer linear programming rank aggregation (RA), and statistical RA. These methods are based on different biological categories and estimate disease-gene association. Previously proposed comparison schemes are based on three measures of performance: receiver operating curve, area under the curve, and median rank ratio. Although they may capture important aspects of gene prioritization performance, they may fail to capture important differences in the rankings of individual genes. We suggest that comparison schemes could be improved by also considering recently proposed measures of similarity between gene rankings. We tested this suggestion on comparison schemes for prioritizations of genes associated with autism that were obtained using brain- and tissue-specific data. Our results show the effectiveness of our measures of similarity in clustering brain regions based on their relevance to autism.


Asunto(s)
Trastorno Autístico/genética , Algoritmos , Encéfalo/patología , Análisis por Conglomerados , Redes Reguladoras de Genes/genética , Predisposición Genética a la Enfermedad/genética , Humanos
10.
BMC Bioinformatics ; 20(Suppl 4): 138, 2019 Apr 18.
Artículo en Inglés | MEDLINE | ID: mdl-30999863

RESUMEN

BACKGROUND: Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k. RESULTS: One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a full exploitation of the underlying distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. CONCLUSIONS: We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.


Asunto(s)
Análisis de Datos , Bases de Datos de Ácidos Nucleicos , Genoma , Estadística como Asunto , Algoritmos , Secuencia de Bases , Programas Informáticos , Factores de Tiempo
11.
Bioinformatics ; 34(11): 1826-1833, 2018 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-29342232

RESUMEN

Motivation: Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Results: Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications. Availability and implementation: The software, including instructions for running it over Amazon AWS, as well as the datasets are available at http://www.di-srv.unisa.it/KCH. Contact: umberto.ferraro@uniroma1.it. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genómica/métodos , Lingüística , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Animales , Bacterias/genética , Análisis por Conglomerados , Epigenómica/métodos , Eucariontes/genética , Humanos , Metagenoma
12.
Bioinformatics ; 33(10): 1575-1577, 2017 May 15.
Artículo en Inglés | MEDLINE | ID: mdl-28093410

RESUMEN

SUMMARY: MapReduce Hadoop bioinformatics applications require the availability of special-purpose routines to manage the input of sequence files. Unfortunately, the Hadoop framework does not provide any built-in support for the most popular sequence file formats like FASTA or BAM. Moreover, the development of these routines is not easy, both because of the diversity of these formats and the need for managing efficiently sequence datasets that may count up to billions of characters. We present FASTdoop, a generic Hadoop library for the management of FASTA and FASTQ files. We show that, with respect to analogous input management routines that have appeared in the Literature, it offers versatility and efficiency. That is, it can handle collections of reads, with or without quality scores, as well as long genomic sequences while the existing routines concentrate mainly on NGS sequence data. Moreover, in the domain where a comparison is possible, the routines proposed here are faster than the available ones. In conclusion, FASTdoop is a much needed addition to Hadoop-BAM. AVAILABILITY AND IMPLEMENTATION: The software and the datasets are available at http://www.di.unisa.it/FASTdoop/ . CONTACT: umberto.ferraro@uniroma1.it. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Sistemas de Administración de Bases de Datos , Genómica/métodos , Almacenamiento y Recuperación de la Información , Análisis de Secuencia de ADN/métodos , Biblioteca de Genes , Genoma Humano , Humanos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...