Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
1.
PLoS One ; 18(5): e0285471, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37200293

RESUMO

This methodological article is mainly aimed at establishing a bridge between classification and regression tasks, in a frame shaped by performance evaluation. More specifically, a general procedure for calculating performance measures is proposed, which can be applied to both classification and regression models. To this end, a notable change in the policy used to evaluate the confusion matrix is made, with the goal of reporting information about regression performance therein. This policy, called generalized token sharing, allows to a) assess models trained on both classification and regression tasks, b) evaluate the importance of input features, and c) inspect the behavior of multilayer perceptrons by looking at their hidden layers. The occurrence of success and failure patterns at the hidden layers of multilayer perceptrons trained and tested on selected regression problems, together with the effectiveness of layer-wise training, is also discussed.

2.
Heliyon ; 9(2): e13368, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-36852030

RESUMO

Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities.

3.
Sci Rep ; 10(1): 21334, 2020 12 07.
Artigo em Inglês | MEDLINE | ID: mdl-33288773

RESUMO

Understanding the inner behaviour of multilayer perceptrons during and after training is a goal of paramount importance for many researchers worldwide. This article experimentally shows that relevant patterns emerge upon training, which are typically related to the underlying problem difficulty. The occurrence of these patterns is highlighted by means of [Formula: see text] diagrams, a 2D graphical tool originally devised to support the work of researchers on classifier performance evaluation and on feature assessment. The underlying assumption being that multilayer perceptrons are powerful engines for feature encoding, hidden layers have been inspected as they were in fact hosting new input features. Interestingly, there are problems that appear difficult if dealt with using a single hidden layer, whereas they turn out to be easier upon the addition of further layers. The experimental findings reported in this article give further support to the standpoint according to which implementing neural architectures with multiple layers may help to boost their generalisation ability. A generic training strategy inspired by some relevant recommendations of deep learning has also been devised. A basic implementation of this strategy has been thoroughly used during the experiments aimed at identifying relevant patterns inside multilayer perceptrons. Further experiments performed in a comparative setting have shown that it could be adopted as viable alternative to the classical backpropagation algorithm.

4.
BMC Bioinformatics ; 19(Suppl 10): 352, 2018 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-30367567

RESUMO

This preface introduces the content of the BioMed Central journal Supplement related to the 14th annual meeting of the Bioinformatics Italian Society, held in Cagliari, Italy, from the 5th to the 7th of July, 2017.


Assuntos
Biologia Computacional , Congressos como Assunto , Humanos , Itália , Publicações Periódicas como Assunto
5.
Sci Rep ; 7(1): 1781, 2017 05 11.
Artigo em Inglês | MEDLINE | ID: mdl-28496113

RESUMO

Innovation is a key ingredient for the evolution of several systems, including social and biological ones. Focused investigations and lateral thinking may lead to innovation, as well as serendipity and other random discovery processes. Some individuals are talented at proposing innovation (say innovators), while others at deeply exploring proposed novelties, at getting further insights on a theory, or at developing products, services, and so on (say developers). This separation in terms of innovators and developers raises an issue of paramount importance: under which conditions a system is able to maintain innovators? According to a simple model, this work investigates the evolutionary dynamics that characterize the emergence of innovation. In particular, we consider a population of innovators and developers, in which agents form small groups whose composition is crucial for their payoff. The latter depends on the heterogeneity of the formed groups, on the amount of innovators they include, and on an award-factor that represents the policy of the system for promoting innovation. Under the hypothesis that a "mobility" effect may support the emergence of innovation, we compare the equilibria reached by our population in different cases. Results confirm the beneficial role of "mobility", and the emergence of further interesting phenomena.


Assuntos
Invenções , Modelos Teóricos , Algoritmos , Humanos
6.
Bioinformatics ; 32(18): 2872-4, 2016 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-27256314

RESUMO

UNLABELLED: RANKS is a flexible software package that can be easily applied to any bioinformatics task formalizable as ranking of nodes with respect to a property given as a label, such as automated protein function prediction, gene disease prioritization and drug repositioning. To this end RANKS provides an efficient and easy-to-use implementation of kernelized score functions, a semi-supervised algorithmic scheme embedding both local and global learning strategies for the analysis of biomolecular networks. To facilitate comparative assessment, baseline network-based methods, e.g. label propagation and random walk algorithms, have also been implemented. AVAILABILITY AND IMPLEMENTATION: The package is available from CRAN: https://cran.r-project.org/ The package is written in R, except for the most computationally intensive functionalities which are implemented in C. CONTACT: valentini@di.unimi.it SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Reposicionamento de Medicamentos , Software , Algoritmos , Biologia Computacional/métodos , Bases de Dados Factuais , Genômica , Humanos , Proteínas , Biologia de Sistemas
7.
BMC Bioinformatics ; 17(Suppl 12): 346, 2016 Nov 08.
Artigo em Inglês | MEDLINE | ID: mdl-28185553

RESUMO

BACKGROUND: During library construction polymerase chain reaction is used to enrich the DNA before sequencing. Typically, this process generates duplicate read sequences. Removal of these artifacts is mandatory, as they can affect the correct interpretation of data in several analyses. Ideally, duplicate reads should be characterized by identical nucleotide sequences. However, due to sequencing errors, duplicates may also be nearly-identical. Removing nearly-identical duplicates can result in a notable computational effort. To deal with this challenge, we recently proposed a GPU method aimed at removing identical and nearly-identical duplicates generated with an Illumina platform. The method implements an approach based on prefix-suffix comparison. Read sequences with identical prefix are considered potential duplicates. Then, their suffixes are compared to identify and remove those that are actually duplicated. Although the method can be efficiently used to remove duplicates, there are some limitations that need to be overcome. In particular, it cannot to detect potential duplicates in the event that prefixes are longer than 27 bases, and it does not provide support for paired-end read libraries. Moreover, large clusters of potential duplicates are split into smaller with the aim to guarantees a reasonable computing time. This heuristic may affect the accuracy of the analysis. RESULTS: In this work we propose GPU-DupRemoval, a new implementation of our method able to (i) cluster reads without constraints on the maximum length of the prefixes, (ii) support both single- and paired-end read libraries, and (iii) analyze large clusters of potential duplicates. CONCLUSIONS: Due to the massive parallelization obtained by exploiting graphics cards, GPU-DupRemoval removes duplicate reads faster than other cutting-edge solutions, while outperforming most of them in terms of amount of duplicates reads.


Assuntos
Biologia Computacional/métodos , DNA/genética , Análise de Sequência de DNA/métodos , Algoritmos , Reação em Cadeia da Polimerase
8.
Artigo em Inglês | MEDLINE | ID: mdl-25806367

RESUMO

Copy number variations (CNVs) are the most prevalent types of structural variations (SVs) in the human genome and are involved in a wide range of common human diseases. Different computational methods have been devised to detect this type of SVs and to study how they are implicated in human diseases. Recently, computational methods based on high-throughput sequencing (HTS) are increasingly used. The majority of these methods focus on mapping short-read sequences generated from a donor against a reference genome to detect signatures distinctive of CNVs. In particular, read-depth based methods detect CNVs by analyzing genomic regions with significantly different read-depth from the other ones. The pipeline analysis of these methods consists of four main stages: (i) data preparation, (ii) data normalization, (iii) CNV regions identification, and (iv) copy number estimation. However, available tools do not support most of the operations required at the first two stages of this pipeline. Typically, they start the analysis by building the read-depth signal from pre-processed alignments. Therefore, third-party tools must be used to perform most of the preliminary operations required to build the read-depth signal. These data-intensive operations can be efficiently parallelized on graphics processing units (GPUs). In this article, we present G-CNV, a GPU-based tool devised to perform the common operations required at the first two stages of the analysis pipeline. G-CNV is able to filter low-quality read sequences, to mask low-quality nucleotides, to remove adapter sequences, to remove duplicated read sequences, to map the short-reads, to resolve multiple mapping ambiguities, to build the read-depth signal, and to normalize it. G-CNV can be efficiently used as a third-party tool able to prepare data for the subsequent read-depth signal generation and analysis. Moreover, it can also be integrated in CNV detection tools to generate read-depth signals.

9.
PLoS One ; 9(5): e97277, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24842718

RESUMO

Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads.


Assuntos
Análise de Sequência de DNA/métodos , Sulfitos/química , Animais , Citosina , Metilação de DNA , Humanos
10.
BMC Bioinformatics ; 15 Suppl 1: S10, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24564714

RESUMO

BACKGROUND: Single Nucleotide Polymorphism (SNP) genotyping analysis is very susceptible to SNPs chromosomal position errors. As it is known, SNPs mapping data are provided along the SNP arrays without any necessary information to assess in advance their accuracy. Moreover, these mapping data are related to a given build of a genome and need to be updated when a new build is available. As a consequence, researchers often plan to remap SNPs with the aim to obtain more up-to-date SNPs chromosomal positions. In this work, we present G-SNPM a GPU (Graphics Processing Unit) based tool to map SNPs on a genome. METHODS: G-SNPM is a tool that maps a short sequence representative of a SNP against a reference DNA sequence in order to find the physical position of the SNP in that sequence. In G-SNPM each SNP is mapped on its related chromosome by means of an automatic three-stage pipeline. In the first stage, G-SNPM uses the GPU-based short-read mapping tool SOAP3-dp to parallel align on a reference chromosome its related sequences representative of a SNP. In the second stage G-SNPM uses another short-read mapping tool to remap the sequences unaligned or ambiguously aligned by SOAP3-dp (in this stage SHRiMP2 is used, which exploits specialized vector computing hardware to speed-up the dynamic programming algorithm of Smith-Waterman). In the last stage, G-SNPM analyzes the alignments obtained by SOAP3-dp and SHRiMP2 to identify the absolute position of each SNP. RESULTS AND CONCLUSIONS: To assess G-SNPM, we used it to remap the SNPs of some commercial chips. Experimental results shown that G-SNPM has been able to remap without ambiguity almost all SNPs. Based on modern GPUs, G-SNPM provides fast mappings without worsening the accuracy of the results. G-SNPM can be used to deal with specialized Genome Wide Association Studies (GWAS), as well as in annotation tasks that require to update the SNP mapping probes.


Assuntos
Cromossomos , Polimorfismo de Nucleotídeo Único , Algoritmos , Sequência de Bases , Mapeamento Cromossômico/métodos , Genoma Humano , Estudo de Associação Genômica Ampla , Genótipo , Humanos , Dados de Sequência Molecular , Alinhamento de Sequência , Software
11.
Adv Bioinformatics ; 2012: 573846, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22778730

RESUMO

The world has widely changed in terms of communicating, acquiring, and storing information. Hundreds of millions of people are involved in information retrieval tasks on a daily basis, in particular while using a Web search engine or searching their e-mail, making such field the dominant form of information access, overtaking traditional database-style searching. How to handle this huge amount of information has now become a challenging issue. In this paper, after recalling the main topics concerning information retrieval, we present a survey on the main works on literature retrieval and mining in bioinformatics. While claiming that information retrieval approaches are useful in bioinformatics tasks, we discuss some challenges aimed at showing the effectiveness of these approaches applied therein.

12.
Artigo em Inglês | MEDLINE | ID: mdl-22201070

RESUMO

Predicting the secondary structure of proteins is still a typical step in several bioinformatic tasks, in particular, for tertiary structure prediction. Notwithstanding the impressive results obtained so far, mostly due to the advent of sequence encoding schemes based on multiple alignment, in our view the problem should be studied from a novel perspective, in which understanding how available information sources are dealt with plays a central role. After revisiting a well-known secondary structure predictor viewed from this perspective (with the goal of identifying which sources of information have been considered and which have not), we propose a generic software architecture designed to account for all relevant information sources. To demonstrate the validity of the approach, a predictor compliant with the proposed generic architecture has been implemented and compared with several state-of-the-art secondary structure predictors. Experiments have been carried out on standard data sets, and the corresponding results confirm the validity of the approach. The predictor is available at http://iasc.diee.unica.it/ssp2/ through the corresponding web application or as downloadable stand-alone portable unpack-and-run bundle.


Assuntos
Algoritmos , Estrutura Secundária de Proteína , Proteínas/química , Bases de Dados de Proteínas , Proteínas/metabolismo , Análise de Sequência de Proteína
13.
Adv Bioinformatics ; 2011: 457578, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21941539

RESUMO

Computational design of novel proteins with well-defined functions is an ongoing topic in computational biology. In this work, we generated and optimized a new synthetic fusion protein using an evolutionary approach. The optimization was guided by directed evolution based on hydrophobicity scores, molecular weight, and secondary structure predictions. Several methods were used to refine the models built from the resulting sequences. We have successfully combined two unrelated naturally occurring binding sites, the immunoglobin Fc-binding site of the Z domain and the DNA-binding motif of MyoD bHLH, into a novel stable protein.

14.
BMC Res Notes ; 2: 202, 2009 Oct 02.
Artigo em Inglês | MEDLINE | ID: mdl-19799773

RESUMO

BACKGROUND: The huge difference between the number of known sequences and known tertiary structures has justified the use of automated methods for protein analysis. Although a general methodology to solve these problems has not been yet devised, researchers are engaged in developing more accurate techniques and algorithms whose training plays a relevant role in determining their performance. From this perspective, particular importance is given to the training data used in experiments, and researchers are often engaged in the generation of specialized datasets that meet their requirements. FINDINGS: To facilitate the task of generating specialized datasets we devised and implemented ProDaMa, an open source Python library than provides classes for retrieving, organizing, updating, analyzing, and filtering protein data. CONCLUSION: ProDaMa has been used to generate specialized datasets useful for secondary structure prediction and to develop a collaborative web application aimed at generating and sharing protein structure datasets. The library, the related database, and the documentation are freely available at the URL http://iasc.diee.unica.it/prodama.

15.
IEEE Trans Nanobioscience ; 6(2): 104-9, 2007 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-17695743

RESUMO

Due to the enormous amount of information available on the Internet, extracting and classifying it has become one of the most important tasks. This principle is valid also while searching for scientific publications. This paper describes a system able to retrieve scientific publications from the Web throughout a text categorization process. To this end, a generic multiagent architecture has been customized according to the requirements imposed by the specific task. Experiments have been performed on publications extracted from BMC Bioinformatics and PubMed digital archives.


Assuntos
Inteligência Artificial , Biologia Computacional/métodos , Armazenamento e Recuperação da Informação/métodos , Internet , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , PubMed , Indexação e Redação de Resumos/métodos , Sistemas de Gerenciamento de Base de Dados
16.
Brief Bioinform ; 8(1): 45-59, 2007 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-16772270

RESUMO

The adoption of agent technologies and multi-agent systems constitutes an emerging area in bioinformatics. In this article, we report on the activity of the Working Group on Agents in Bioinformatics (BIOAGENTS) founded during the first AgentLink III Technical Forum meeting on the 2nd of July, 2004, in Rome. The meeting provided an opportunity for seeding collaborations between the agent and bioinformatics communities to develop a different (agent-based) approach of computational frameworks both for data analysis and management in bioinformatics and for systems modelling and simulation in computational and systems biology. The collaborations gave rise to applications and integrated tools that we summarize and discuss in context of the state of the art in this area. We investigate on future challenges and argue that the field should still be explored from many perspectives ranging from bio-conceptual languages for agent-based simulation, to the definition of bio-ontology-based declarative languages to be used by information agents, and to the adoption of agents for computational grids.


Assuntos
Inteligência Artificial , Biologia Computacional/métodos , Software/tendências , Biologia de Sistemas/tendências , Doenças Genéticas Inatas/genética , Humanos , Gestão da Informação , Modelos Biológicos , Estrutura Secundária de Proteína , Semântica , Células-Tronco/fisiologia
17.
BMC Bioinformatics ; 6 Suppl 4: S3, 2005 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-16351752

RESUMO

BACKGROUND: Due to the strict relation between protein function and structure, the prediction of protein 3D-structure has become one of the most important tasks in bioinformatics and proteomics. In fact, notwithstanding the increase of experimental data on protein structures available in public databases, the gap between known sequences and known tertiary structures is constantly increasing. The need for automatic methods has brought the development of several prediction and modelling tools, but a general methodology able to solve the problem has not yet been devised, and most methodologies concentrate on the simplified task of predicting secondary structure. RESULTS: In this paper we concentrate on the problem of predicting secondary structures by adopting a technology based on multiple experts. The system performs an overall processing based on two main steps: first, a "sequence-to-structure" prediction is enforced by resorting to a population of hybrid (genetic-neural) experts, and then a "structure-to-structure" prediction is performed by resorting to an artificial neural network. Experiments, performed on sequences taken from well-known protein databases, allowed to reach an accuracy of about 76%, which is comparable to those obtained by state-of-the-art predictors. CONCLUSION: The adoption of a hybrid technique, which encompasses genetic and neural technologies, has demonstrated to be a promising approach in the task of protein secondary structure prediction.


Assuntos
Biologia Computacional/métodos , Estrutura Secundária de Proteína , Proteínas/química , Algoritmos , Simulação por Computador , Bases de Dados de Proteínas , Modelos Químicos , Modelos Moleculares , Dados de Sequência Molecular , Redes Neurais de Computação , Conformação Proteica , Dobramento de Proteína , Estrutura Terciária de Proteína , Alinhamento de Sequência , Análise de Sequência de Proteína , Software
18.
IEEE Trans Nanobioscience ; 4(3): 207-11, 2005 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-16220683

RESUMO

It is well known that protein secondary-structure information can help the process of performing multiple alignment, in particular when the amount of similarity among the involved sequences moves toward the "twilight zone" (less than 30% of pairwise similarity). In this paper, a multiple alignment algorithm is presented, explicitly designed for exploiting any available secondary-structure information. A layered architecture with two interacting levels has been defined for dealing with both primary- and secondary-structure information of target sequences. Secondary structure (either available or predicted by resorting to a technique based on multiple experts) is used to calculate an initial alignment at the secondary level, to be arranged by locally scoped operators devised to refine the alignment at the primary level. Aimed at evaluating the impact of secondary information on the quality of alignments, in particular alignments with a low degree of similarity, the technique has been implemented and assessed on relevant test cases.


Assuntos
Algoritmos , Proteínas/análise , Proteínas/química , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Dados de Sequência Molecular , Conformação Proteica , Estrutura Secundária de Proteína , Alinhamento de Sequência/métodos , Homologia de Sequência de Aminoácidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA