Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
1.
PLoS Comput Biol ; 19(7): e1011272, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37471333

RESUMO

Some scientific studies involve huge amounts of bioinformatics data that cannot be analyzed on personal computers usually employed by researchers for day-to-day activities but rather necessitate effective computational infrastructures that can work in a distributed way. For this purpose, distributed computing systems have become useful tools to analyze large amounts of bioinformatics data and to generate relevant results on virtual environments, where software can be executed for hours or even days without affecting the personal computer or laptop of a researcher. Even if distributed computing resources have become pivotal in multiple bioinformatics laboratories, often researchers and students use them in the wrong ways, making mistakes that can cause the distributed computers to underperform or that can even generate wrong outcomes. In this context, we present here ten quick tips for the usage of Apache Spark distributed computing systems for bioinformatics analyses: ten simple guidelines that, if taken into account, can help users avoid common mistakes and can help them run their bioinformatics analyses smoothly. Even if we designed our recommendations for beginners and students, they should be followed by experts too. We think our quick tips can help anyone make use of Apache Spark distributed computing systems more efficiently and ultimately help generate better, more reliable scientific results.


Assuntos
Biologia Computacional , Software , Humanos , Biologia Computacional/métodos , Computadores , Redes de Comunicação de Computadores
2.
Bioinformatics ; 38(4): 925-932, 2022 01 27.
Artigo em Inglês | MEDLINE | ID: mdl-34718420

RESUMO

MOTIVATION: Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e. their ability to identify true similarity, has been limited to some members of the D2 family. The corresponding experimental studies have concentrated on short sequences, a scenario no longer adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either missing or limited. RESULTS: By concentrating on a representative set of word-frequency-based AF functions, we perform the first coherent and uniform evaluation of the power, involving also Type I error for completeness. Two alternative models of important genomic features (CIS Regulatory Modules and Horizontal Gene Transfer), a wide range of sequence lengths from a few thousand to millions, and different values of k have been used. As a result, we provide a characterization of those AF functions that is novel and informative. Indeed, we identify weak and strong points of each function considered, which may be used as a guide to choose one for analysis tasks. Remarkably, of the 15 functions that we have considered, only four stand out, with small differences between small and short sequence length scenarios. Finally, to encourage the use of our methodology for validation of future AF functions, the Big Data platform supporting it is public. AVAILABILITY AND IMPLEMENTATION: The software is available at: https://github.com/pipp8/power_statistics. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Análise de Sequência , Genômica
3.
Bioinformatics ; 37(12): 1658-1665, 2021 Jul 19.
Artigo em Inglês | MEDLINE | ID: mdl-33471066

RESUMO

MOTIVATION: Alignment-free distance and similarity functions (AF functions, for short) are a well-established alternative to pairwise and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in computational biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. RESULTS: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (i) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (ii) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (iii) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE. AVAILABILITYAND IMPLEMENTATION: The software and the datasets are available at https://github.com/fpalini/fade. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

4.
Sensors (Basel) ; 22(16)2022 Aug 14.
Artigo em Inglês | MEDLINE | ID: mdl-36015837

RESUMO

Face recognition is an important application of pattern recognition and image analysis in biometric security systems. The COVID-19 outbreak has introduced several issues that can negatively affect the reliability of the facial recognition systems currently available: on the one hand, wearing a face mask/covering has led to growth in failure cases, while on the other, the restrictions on direct contact between people can prevent any biometric data being acquired in controlled environments. To effectively address these issues, we designed a hybrid methodology that improves the reliability of facial recognition systems. A well-known Source Camera Identification (SCI) technique, based on Pixel Non-Uniformity (PNU), was applied to analyze the integrity of the input video stream as well as to detect any tampered/fake frames. To examine the behavior of this methodology in real-life use cases, we implemented a prototype that showed two novel properties compared to the current state-of-the-art of biometric systems: (a) high accuracy even when subjects are wearing a face mask; (b) whenever the input video is produced by deep fake techniques (replacing the face of the main subject) the system can recognize that it has been altered providing more than one alert message. This methodology proved not only to be simultaneously more robust to mask induced occlusions but also even more reliable in preventing forgery attacks on the input video stream.


Assuntos
Identificação Biométrica , COVID-19 , Reconhecimento Facial , Algoritmos , Identificação Biométrica/métodos , Biometria/métodos , COVID-19/prevenção & controle , Humanos , Processamento de Imagem Assistida por Computador/métodos , Reprodutibilidade dos Testes
5.
BMC Bioinformatics ; 22(1): 144, 2021 Mar 22.
Artigo em Inglês | MEDLINE | ID: mdl-33752596

RESUMO

BACKGROUND: Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. RESULTS: We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in better space savings, and even in better execution times over compressed data, with respect to the use of generic compressors available in Hadoop, in particular for FASTQ files. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System. CONCLUSIONS: Our Methods and the corresponding software substantially contribute to achieve space and time savings for the storage and processing of FASTA/Q files in Hadoop and Spark. Being our approach general, it is very likely that it can be applied also to FASTA/Q compression methods that will appear in the future. AVAILABILITY: The software and the datasets are available at https://github.com/fpalini/fastdoopc.


Assuntos
Compressão de Dados , Genômica , Software , Algoritmos , Big Data
6.
BMC Bioinformatics ; 20(Suppl 4): 138, 2019 Apr 18.
Artigo em Inglês | MEDLINE | ID: mdl-30999863

RESUMO

BACKGROUND: Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k. RESULTS: One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a full exploitation of the underlying distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. CONCLUSIONS: We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.


Assuntos
Análise de Dados , Bases de Dados de Ácidos Nucleicos , Genoma , Estatística como Assunto , Algoritmos , Sequência de Bases , Software , Fatores de Tempo
7.
Bioinformatics ; 34(11): 1826-1833, 2018 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-29342232

RESUMO

Motivation: Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Results: Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications. Availability and implementation: The software, including instructions for running it over Amazon AWS, as well as the datasets are available at http://www.di-srv.unisa.it/KCH. Contact: umberto.ferraro@uniroma1.it. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica/métodos , Linguística , Análise de Sequência de DNA/métodos , Software , Algoritmos , Animais , Bactérias/genética , Análise por Conglomerados , Epigenômica/métodos , Eucariotos/genética , Humanos , Metagenoma
8.
Bioinformatics ; 33(10): 1575-1577, 2017 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-28093410

RESUMO

SUMMARY: MapReduce Hadoop bioinformatics applications require the availability of special-purpose routines to manage the input of sequence files. Unfortunately, the Hadoop framework does not provide any built-in support for the most popular sequence file formats like FASTA or BAM. Moreover, the development of these routines is not easy, both because of the diversity of these formats and the need for managing efficiently sequence datasets that may count up to billions of characters. We present FASTdoop, a generic Hadoop library for the management of FASTA and FASTQ files. We show that, with respect to analogous input management routines that have appeared in the Literature, it offers versatility and efficiency. That is, it can handle collections of reads, with or without quality scores, as well as long genomic sequences while the existing routines concentrate mainly on NGS sequence data. Moreover, in the domain where a comparison is possible, the routines proposed here are faster than the available ones. In conclusion, FASTdoop is a much needed addition to Hadoop-BAM. AVAILABILITY AND IMPLEMENTATION: The software and the datasets are available at http://www.di.unisa.it/FASTdoop/ . CONTACT: umberto.ferraro@uniroma1.it. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sistemas de Gerenciamento de Base de Dados , Genômica/métodos , Armazenamento e Recuperação da Informação , Análise de Sequência de DNA/métodos , Biblioteca Gênica , Genoma Humano , Humanos
10.
Bioinformation ; 10(1): 43-7, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24516326

RESUMO

MOTIVATION: Biologists and chemists are facing problems of high computational complexity that require the use of several computers organized in clusters or in specialized grids. Examples of such problems can be found in molecular dynamics (MD), in silico screening, and genome analysis. Grid Computing and Cloud Computing are becoming prevalent mainly because of their competitive performance/cost ratio. Regrettably, the diffusion of Grid Computing is strongly limited because two main limitations: it is confined to scientists with strong Computer Science background and the analyses of the large amount of data produced can be cumbersome it. We have developed a package named GRIMD to provide an easy and flexible implementation of distributed computing for the Bioinformatics community. GRIMD is very easy to install and maintain, and it does not require any specific Computer Science skill. Moreover, permits preliminary analysis on the distributed machines to reduce the amount of data to transfer. GRIMD is very flexible because it shields the typical computational biologist from the need to write specific code for tasks such as molecular dynamics or docking calculations. Furthermore, it permits an efficient use of GPU cards whenever is possible. GRIMD calculations scale almost linearly and, therefore, permits to exploit efficiently each machine in the network. Here, we provide few examples of grid computing in computational biology (MD and docking) and bioinformatics (proteome analysis). AVAILABILITY: GRIMD is available for free for noncommercial research at www.yadamp.unisa.it/grimd. SUPPLEMENTARY INFORMATION: www.yadamp.unisa.it/grimd/howto.aspx.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA