Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
1.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38349060

RESUMO

The recent development of deep learning methods have undoubtedly led to great improvement in various machine learning tasks, especially in prediction tasks. This type of methods have also been adapted to answer various problems in bioinformatics, including automatic genome annotation, artificial genome generation or phenotype prediction. In particular, a specific type of deep learning method, called graph neural network (GNN) has repeatedly been reported as a good candidate to predict phenotypes from gene expression because its ability to embed information on gene regulation or co-expression through the use of a gene network. However, up to date, no complete and reproducible benchmark has ever been performed to analyze the trade-off between cost and benefit of this approach compared to more standard (and simpler) machine learning methods. In this article, we provide such a benchmark, based on clear and comparable policies to evaluate the different methods on several datasets. Our conclusion is that GNN rarely provides a real improvement in prediction performance, especially when compared to the computation effort required by the methods. Our findings on a limited but controlled simulated dataset shows that this could be explained by the limited quality or predictive power of the input biological gene network itself.


Assuntos
Perfilação da Expressão Gênica , Transcriptoma , Benchmarking , Biologia Computacional , Redes Neurais de Computação
2.
Bioinformatics ; 38(17): 4127-4134, 2022 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-35792837

RESUMO

MOTIVATION: Inferring gene regulatory networks in non-independent genetically related panels is a methodological challenge. This hampers evolutionary and biological studies using heterozygote individuals such as in wild sunflower populations or cultivated hybrids. RESULTS: First, we simulated 100 datasets of gene expressions and polymorphisms, displaying the same gene expression distributions, heterozygosities and heritabilities as in our dataset including 173 genes and 353 genotypes measured in sunflower hybrids. Secondly, we performed a meta-analysis based on six inference methods [least absolute shrinkage and selection operator (Lasso), Random Forests, Bayesian Networks, Markov Random Fields, Ordinary Least Square and fast inference of networks from directed regulation (Findr)] and selected the minimal density networks for better accuracy with 64 edges connecting 79 genes and 0.35 area under precision and recall (AUPR) score on average. We identified that triangles and mutual edges are prone to errors in the inferred networks. Applied on classical datasets without heterozygotes, our strategy produced a 0.65 AUPR score for one dataset of the DREAM5 Systems Genetics Challenge. Finally, we applied our method to an experimental dataset from sunflower hybrids. We successfully inferred a network composed of 105 genes connected by 106 putative regulations with a major connected component. AVAILABILITY AND IMPLEMENTATION: Our inference methodology dedicated to genomic and transcriptomic data is available at https://forgemia.inra.fr/sunrise/inference_methods. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Reguladoras de Genes , Transcriptoma , Humanos , Heterozigoto , Teorema de Bayes , Genômica , Algoritmos
3.
Bioinformatics ; 34(17): i875-i883, 2018 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30423079

RESUMO

Motivation: Liquid Chromatography (LC) followed by tandem Mass Spectrometry (MS/MS) is one of the predominant methods for metabolite identification. In recent years, machine learning has started to transform the analysis of tandem mass spectra and the identification of small molecules. In contrast, LC data is rarely used to improve metabolite identification, despite numerous published methods for retention time prediction using machine learning. Results: We present a machine learning method for predicting the retention order of molecules; that is, the order in which molecules elute from the LC column. Our method has important advantages over previous approaches: We show that retention order is much better conserved between instruments than retention time. To this end, our method can be trained using retention time measurements from different LC systems and configurations without tedious pre-processing, significantly increasing the amount of available training data. Our experiments demonstrate that retention order prediction is an effective way to learn retention behaviour of molecules from heterogeneous retention time data. Finally, we demonstrate how retention order prediction and MS/MS-based scores can be combined for more accurate metabolite identifications when analyzing a complete LC-MS/MS run. Availability and implementation: Implementation of the method is available at https://version.aalto.fi/gitlab/bache1/retention_order_prediction.git.


Assuntos
Cromatografia Líquida/métodos , Espectrometria de Massas em Tandem/métodos
4.
Bioinformatics ; 32(12): i28-i36, 2016 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-27307628

RESUMO

MOTIVATION: An important problematic of metabolomics is to identify metabolites using tandem mass spectrometry data. Machine learning methods have been proposed recently to solve this problem by predicting molecular fingerprint vectors and matching these fingerprints against existing molecular structure databases. In this work we propose to address the metabolite identification problem using a structured output prediction approach. This type of approach is not limited to vector output space and can handle structured output space such as the molecule space. RESULTS: We use the Input Output Kernel Regression method to learn the mapping between tandem mass spectra and molecular structures. The principle of this method is to encode the similarities in the input (spectra) space and the similarities in the output (molecule) space using two kernel functions. This method approximates the spectra-molecule mapping in two phases. The first phase corresponds to a regression problem from the input space to the feature space associated to the output kernel. The second phase is a preimage problem, consisting in mapping back the predicted output feature vectors to the molecule space. We show that our approach achieves state-of-the-art accuracy in metabolite identification. Moreover, our method has the advantage of decreasing the running times for the training step and the test step by several orders of magnitude over the preceding methods. CONTACT: celine.brouard@aalto.fi SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina , Metabolômica , Estrutura Molecular , Espectrometria de Massas em Tandem , Algoritmos , Bases de Dados de Compostos Químicos
5.
BMC Bioinformatics ; 14: 273, 2013 Sep 12.
Artigo em Inglês | MEDLINE | ID: mdl-24028533

RESUMO

BACKGROUND: Gene regulatory network inference remains a challenging problem in systems biology despite the numerous approaches that have been proposed. When substantial knowledge on a gene regulatory network is already available, supervised network inference is appropriate. Such a method builds a binary classifier able to assign a class (Regulation/No regulation) to an ordered pair of genes. Once learnt, the pairwise classifier can be used to predict new regulations. In this work, we explore the framework of Markov Logic Networks (MLN) that combine features of probabilistic graphical models with the expressivity of first-order logic rules. RESULTS: We propose to learn a Markov Logic network, e.g. a set of weighted rules that conclude on the predicate "regulates", starting from a known gene regulatory network involved in the switch proliferation/differentiation of keratinocyte cells, a set of experimental transcriptomic data and various descriptions of genes all encoded into first-order logic. As training data are unbalanced, we use asymmetric bagging to learn a set of MLNs. The prediction of a new regulation can then be obtained by averaging predictions of individual MLNs. As a side contribution, we propose three in silico tests to assess the performance of any pairwise classifier in various network inference tasks on real datasets. A first test consists of measuring the average performance on balanced edge prediction problem; a second one deals with the ability of the classifier, once enhanced by asymmetric bagging, to update a given network. Finally our main result concerns a third test that measures the ability of the method to predict regulations with a new set of genes. As expected, MLN, when provided with only numerical discretized gene expression data, does not perform as well as a pairwise SVM in terms of AUPR. However, when a more complete description of gene properties is provided by heterogeneous sources, MLN achieves the same performance as a black-box model such as a pairwise SVM while providing relevant insights on the predictions. CONCLUSIONS: The numerical studies show that MLN achieves very good predictive performance while opening the door to some interpretability of the decisions. Besides the ability to suggest new regulations, such an approach allows to cross-validate experimental data with existing knowledge.


Assuntos
Redes Reguladoras de Genes , Lógica , Cadeias de Markov , Biologia de Sistemas/métodos , Simulação por Computador , Bases de Dados Genéticas , Humanos , Modelos Estatísticos , Curva ROC , Máquina de Vetores de Suporte
6.
NAR Genom Bioinform ; 4(1): lqac014, 2022 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35265835

RESUMO

The substantial development of high-throughput biotechnologies has rendered large-scale multi-omics datasets increasingly available. New challenges have emerged to process and integrate this large volume of information, often obtained from widely heterogeneous sources. Kernel methods have proven successful to handle the analysis of different types of datasets obtained on the same individuals. However, they usually suffer from a lack of interpretability since the original description of the individuals is lost due to the kernel embedding. We propose novel feature selection methods that are adapted to the kernel framework and go beyond the well-established work in supervised learning by addressing the more difficult tasks of unsupervised learning and kernel output learning. The method is expressed under the form of a non-convex optimization problem with a ℓ1 penalty, which is solved with a proximal gradient descent approach. It is tested on several systems biology datasets and shows good performances in selecting relevant and less redundant features compared to existing alternatives. It also proved relevant for identifying important governmental measures best explaining the time series of Covid-19 reproducing number evolution during the first months of 2020. The proposed feature selection method is embedded in the R package mixKernel version 0.8, published on CRAN. Installation instructions are available at http://mixkernel.clementine.wf/.

7.
Metabolites ; 9(8)2019 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-31374904

RESUMO

In small molecule identification from tandem mass (MS/MS) spectra, input-output kernel regression (IOKR) currently provides the state-of-the-art combination of fast training and prediction and high identification rates. The IOKR approach can be simply understood as predicting a fingerprint vector from the MS/MS spectrum of the unknown molecule, and solving a pre-image problem to find the molecule with the most similar fingerprint. In this paper, we bring forward the following improvements to the IOKR framework: firstly, we formulate the IOKRreverse model that can be understood as mapping molecular structures into the MS/MS feature space and solving a pre-image problem to find the molecule whose predicted spectrum is the closest to the input MS/MS spectrum. Secondly, we introduce an approach to combine several IOKR and IOKRreverse models computed from different input and output kernels, called IOKRfusion. The method is based on minimizing structured Hinge loss of the combined model using a mini-batch stochastic subgradient optimization. Our experiments show a consistent improvement of top-k accuracy both in positive and negative ionization mode data.

8.
J Cheminform ; 9(1): 22, 2017 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-29086042

RESUMO

BACKGROUND: The fourth round of the Critical Assessment of Small Molecule Identification (CASMI) Contest ( www.casmi-contest.org ) was held in 2016, with two new categories for automated methods. This article covers the 208 challenges in Categories 2 and 3, without and with metadata, from organization, participation, results and post-contest evaluation of CASMI 2016 through to perspectives for future contests and small molecule annotation/identification. RESULTS: The Input Output Kernel Regression (CSI:IOKR) machine learning approach performed best in "Category 2: Best Automatic Structural Identification-In Silico Fragmentation Only", won by Team Brouard with 41% challenge wins. The winner of "Category 3: Best Automatic Structural Identification-Full Information" was Team Kind (MS-FINDER), with 76% challenge wins. The best methods were able to achieve over 30% Top 1 ranks in Category 2, with all methods ranking the correct candidate in the Top 10 in around 50% of challenges. This success rate rose to 70% Top 1 ranks in Category 3, with candidates in the Top 10 in over 80% of the challenges. The machine learning and chemistry-based approaches are shown to perform in complementary ways. CONCLUSIONS: The improvement in (semi-)automated fragmentation methods for small molecule identification has been substantial. The achieved high rates of correct candidates in the Top 1 and Top 10, despite large candidate numbers, open up great possibilities for high-throughput annotation of untargeted analysis for "known unknowns". As more high quality training data becomes available, the improvements in machine learning methods will likely continue, but the alternative approaches still provide valuable complementary information. Improved integration of experimental context will also improve identification success further for "real life" annotations. The true "unknown unknowns" remain to be evaluated in future CASMI contests. Graphical abstract .

9.
PLoS One ; 11(7): e0159302, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27441920

RESUMO

In this paper we apply machine learning methods for predicting protein interactions in fungal secretion pathways. We assume an inter-species transfer setting, where training data is obtained from a single species and the objective is to predict protein interactions in other, related species. In our methodology, we combine several state of the art machine learning approaches, namely, multiple kernel learning (MKL), pairwise kernels and kernelized structured output prediction in the supervised graph inference framework. For MKL, we apply recently proposed centered kernel alignment and p-norm path following approaches to integrate several feature sets describing the proteins, demonstrating improved performance. For graph inference, we apply input-output kernel regression (IOKR) in supervised and semi-supervised modes as well as output kernel trees (OK3). In our experiments simulating increasing genetic distance, Input-Output Kernel Regression proved to be the most robust prediction approach. We also show that the MKL approaches improve the predictions compared to uniform combination of the kernels. We evaluate the methods on the task of predicting protein-protein-interactions in the secretion pathways in fungi, S.cerevisiae, baker's yeast, being the source, T. reesei being the target of the inter-species transfer learning. We identify completely novel candidate secretion proteins conserved in filamentous fungi. These proteins could contribute to their unique secretion capabilities.


Assuntos
Proteínas Fúngicas/metabolismo , Aprendizado de Máquina , Mapeamento de Interação de Proteínas , Saccharomyces cerevisiae/metabolismo , Via Secretória , Trichoderma/metabolismo , Algoritmos , Sequência de Aminoácidos , Bases de Dados de Proteínas , Evolução Molecular , Proteínas Fúngicas/química , Genoma Fúngico , Mapas de Interação de Proteínas , Curva ROC , Saccharomyces cerevisiae/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA