Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
1.
Entropy (Basel) ; 26(7)2024 Jul 11.
Artigo em Inglês | MEDLINE | ID: mdl-39056955

RESUMO

We introduce NodeFlow, a flexible framework for probabilistic regression on tabular data that combines Neural Oblivious Decision Ensembles (NODEs) and Conditional Continuous Normalizing Flows (CNFs). It offers improved modeling capabilities for arbitrary probabilistic distributions, addressing the limitations of traditional parametric approaches. In NodeFlow, the NODE captures complex relationships in tabular data through a tree-like structure, while the conditional CNF utilizes the NODE's output space as a conditioning factor. The training process of NodeFlow employs standard gradient-based learning, facilitating the end-to-end optimization of the NODEs and CNF-based density estimation. This approach ensures outstanding performance, ease of implementation, and scalability, making NodeFlow an appealing choice for practitioners and researchers. Comprehensive assessments on benchmark datasets underscore NodeFlow's efficacy, revealing its achievement of state-of-the-art outcomes in multivariate probabilistic regression setup and its strong performance in univariate regression tasks. Furthermore, ablation studies are conducted to justify the design choices of NodeFlow. In conclusion, NodeFlow's end-to-end training process and strong performance make it a compelling solution for practitioners and researchers. Additionally, it opens new avenues for research and application in the field of probabilistic regression on tabular data.

2.
BMC Bioinformatics ; 21(1): 49, 2020 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-32033537

RESUMO

BACKGROUND: Computational prediction of drug-target interactions (DTI) is vital for drug discovery. The experimental identification of interactions between drugs and target proteins is very onerous. Modern technologies have mitigated the problem, leveraging the development of new drugs. However, drug development remains extremely expensive and time consuming. Therefore, in silico DTI predictions based on machine learning can alleviate the burdensome task of drug development. Many machine learning approaches have been proposed over the years for DTI prediction. Nevertheless, prediction accuracy and efficiency are persisting problems that still need to be tackled. Here, we propose a new learning method which addresses DTI prediction as a multi-output prediction task by learning ensembles of multi-output bi-clustering trees (eBICT) on reconstructed networks. In our setting, the nodes of a DTI network (drugs and proteins) are represented by features (background information). The interactions between the nodes of a DTI network are modeled as an interaction matrix and compose the output space in our problem. The proposed approach integrates background information from both drug and target protein spaces into the same global network framework. RESULTS: We performed an empirical evaluation, comparing the proposed approach to state of the art DTI prediction methods and demonstrated the effectiveness of the proposed approach in different prediction settings. For evaluation purposes, we used several benchmark datasets that represent drug-protein networks. We show that output space reconstruction can boost the predictive performance of tree-ensemble learning methods, yielding more accurate DTI predictions. CONCLUSIONS: We proposed a new DTI prediction method where bi-clustering trees are built on reconstructed networks. Building tree-ensemble learning models with output space reconstruction leads to superior prediction results, while preserving the advantages of tree-ensembles, such as scalability, interpretability and inductive setting.


Assuntos
Descoberta de Drogas/métodos , Aprendizado de Máquina , Proteínas/efeitos dos fármacos , Análise por Conglomerados , Simulação por Computador , Desenvolvimento de Medicamentos
3.
BMC Bioinformatics ; 20(1): 525, 2019 Oct 28.
Artigo em Inglês | MEDLINE | ID: mdl-31660848

RESUMO

BACKGROUND: Network inference is crucial for biomedicine and systems biology. Biological entities and their associations are often modeled as interaction networks. Examples include drug protein interaction or gene regulatory networks. Studying and elucidating such networks can lead to the comprehension of complex biological processes. However, usually we have only partial knowledge of those networks and the experimental identification of all the existing associations between biological entities is very time consuming and particularly expensive. Many computational approaches have been proposed over the years for network inference, nonetheless, efficiency and accuracy are still persisting open problems. Here, we propose bi-clustering tree ensembles as a new machine learning method for network inference, extending the traditional tree-ensemble models to the global network setting. The proposed approach addresses the network inference problem as a multi-label classification task. More specifically, the nodes of a network (e.g., drugs or proteins in a drug-protein interaction network) are modelled as samples described by features (e.g., chemical structure similarities or protein sequence similarities). The labels in our setting represent the presence or absence of links connecting the nodes of the interaction network (e.g., drug-protein interactions in a drug-protein interaction network). RESULTS: We extended traditional tree-ensemble methods, such as extremely randomized trees (ERT) and random forests (RF) to ensembles of bi-clustering trees, integrating background information from both node sets of a heterogeneous network into the same learning framework. We performed an empirical evaluation, comparing the proposed approach to currently used tree-ensemble based approaches as well as other approaches from the literature. We demonstrated the effectiveness of our approach in different interaction prediction (network inference) settings. For evaluation purposes, we used several benchmark datasets that represent drug-protein and gene regulatory networks. We also applied our proposed method to two versions of a chemical-protein association network extracted from the STITCH database, demonstrating the potential of our model in predicting non-reported interactions. CONCLUSIONS: Bi-clustering trees outperform existing tree-based strategies as well as machine learning methods based on other algorithms. Since our approach is based on tree-ensembles it inherits the advantages of tree-ensemble learning, such as handling of missing values, scalability and interpretability.


Assuntos
Análise por Conglomerados , Algoritmos , Bases de Dados Factuais , Redes Reguladoras de Genes , Aprendizado de Máquina , Mapas de Interação de Proteínas , Proteínas/metabolismo
4.
J Biomed Inform ; 85: 40-48, 2018 09.
Artigo em Inglês | MEDLINE | ID: mdl-30012356

RESUMO

The volume of biomedical data available to the machine learning community grows very rapidly. A rational question is how informative these data really are or how discriminant the features describing the data instances are. Several biomedical datasets suffer from lack of variance in the instance representation, or even worse, contain instances with identical features and different class labels. Indisputably, this directly affects the performance of machine learning algorithms, as well as the ability to interpret their results. In this article, we emphasize on the aforementioned problem and propose a target-informed feature induction method based on tree ensemble learning. The method brings more variance into the data representation, thereby potentially increasing predictive performance of a learner applied to the induced features. The contribution of this article is twofold. Firstly, a problem affecting the quality of biomedical data is highlighted, and secondly, a method to handle that problem is proposed. The efficiency of the presented approach is validated on multi-target prediction tasks. The obtained results indicate that the proposed approach is able to boost the discrimination between the data instances and increase the predictive performance.


Assuntos
Análise por Conglomerados , Mineração de Dados/métodos , Árvores de Decisões , Aprendizado de Máquina , Algoritmos , Biologia Computacional , Bases de Dados Factuais/estatística & dados numéricos , Escherichia coli/genética , Escherichia coli/metabolismo , Redes Reguladoras de Genes , Humanos , Redes e Vias Metabólicas , Mapas de Interação de Proteínas , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo
5.
Glob Chang Biol ; 23(7): 2720-2742, 2017 07.
Artigo em Inglês | MEDLINE | ID: mdl-27976458

RESUMO

Grassland ecosystems act as a crucial role in the global carbon cycle and provide vital ecosystem services for many species. However, these low-productivity and water-limited ecosystems are sensitive and vulnerable to climate perturbations and human intervention, the latter of which is often not considered due to lack of spatial information regarding the grassland management. Here by the application of a model tree ensemble (MTE-GRASS) trained on local eddy covariance data and using as predictors gridded climate and management intensity field (grazing and cutting), we first provide an estimate of global grassland gross primary production (GPP). GPP from our study compares well (modeling efficiency NSE = 0.85 spatial; NSE between 0.69 and 0.94 interannual) with that from flux measurement. Global grassland GPP was on average 11 ± 0.31 Pg C yr-1 and exhibited significantly increasing trend at both annual and seasonal scales, with an annual increase of 0.023 Pg C (0.2%) from 1982 to 2011. Meanwhile, we found that at both annual and seasonal scale, the trend (except for northern summer) and interannual variability of the GPP are primarily driven by arid/semiarid ecosystems, the latter of which is due to the larger variation in precipitation. Grasslands in arid/semiarid regions have a stronger (33 g C m-2  yr-1 /100 mm) and faster (0- to 1-month time lag) response to precipitation than those in other regions. Although globally spatial gradients (71%) and interannual changes (51%) in GPP were mainly driven by precipitation, where most regions with arid/semiarid climate zone, temperature and radiation together shared half of GPP variability, which is mainly distributed in the high-latitude or cold regions. Our findings and the results of other studies suggest the overwhelming importance of arid/semiarid regions as a control on grassland ecosystems carbon cycle. Similarly, under the projected future climate change, grassland ecosystems in these regions will be potentially greatly influenced.


Assuntos
Ciclo do Carbono , Dióxido de Carbono , Pradaria , Árvores , Clima , Mudança Climática , Ecossistema
6.
Comput Biol Med ; 152: 106423, 2023 01.
Artigo em Inglês | MEDLINE | ID: mdl-36529023

RESUMO

With the development of new sequencing technologies, availability of genomic data has grown exponentially. Over the past decade, numerous studies have used genomic data to identify associations between genes and biological functions. While these studies have shown success in annotating genes with functions, they often assume that genes are completely annotated and fail to take into account that datasets are sparse and noisy. This work proposes a method to detect missing annotations in the context of hierarchical multi-label classification. More precisely, our method exploits the relations of functions, represented as a hierarchy, by computing probabilities based on the paths of functions in the hierarchy. By performing several experiments on a variety of rice (Oriza sativa Japonica), we showcase that the proposed method accurately detects missing annotations and yields superior results when compared to state-of-art methods from the literature.


Assuntos
Genômica , Ontologia Genética , Anotação de Sequência Molecular , Probabilidade
7.
Top (Berl) ; 29(1): 5-33, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-38624654

RESUMO

Classification and regression trees, as well as their variants, are off-the-shelf methods in Machine Learning. In this paper, we review recent contributions within the Continuous Optimization and the Mixed-Integer Linear Optimization paradigms to develop novel formulations in this research area. We compare those in terms of the nature of the decision variables and the constraints required, as well as the optimization algorithms proposed. We illustrate how these powerful formulations enhance the flexibility of tree models, being better suited to incorporate desirable properties such as cost-sensitivity, explainability, and fairness, and to deal with complex data, such as functional data.

8.
Materials (Basel) ; 14(15)2021 Aug 03.
Artigo em Inglês | MEDLINE | ID: mdl-34361540

RESUMO

This paper gives a comprehensive overview of the state-of-the-art machine learning methods that can be used for estimating self-compacting rubberized concrete (SCRC) compressive strength, including multilayered perceptron artificial neural network (MLP-ANN), ensembles of MLP-ANNs, regression tree ensembles (random forests, boosted and bagged regression trees), support vector regression (SVR) and Gaussian process regression (GPR). As a basis for the development of the forecast model, a database was obtained from an experimental study containing a total of 166 samples of SCRC. Ensembles of MLP-ANNs showed the best performance in forecasting with a mean absolute error (MAE) of 2.81 MPa and Pearson's linear correlation coefficient (R) of 0.96. The significantly simpler GPR model had almost the same accuracy criterion values as the most accurate model; furthermore, feature reduction is easy to combine with GPR using automatic relevance determination (ARD), leading to models with better performance and lower complexity.

9.
Methods Mol Biol ; 1883: 195-215, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30547401

RESUMO

In this chapter, we introduce the reader to a popular family of machine learning algorithms, called decision trees. We then review several approaches based on decision trees that have been developed for the inference of gene regulatory networks (GRNs). Decision trees have indeed several nice properties that make them well-suited for tackling this problem: they are able to detect multivariate interacting effects between variables, are non-parametric, have good scalability, and have very few parameters. In particular, we describe in detail the GENIE3 algorithm, a state-of-the-art method for GRN inference.


Assuntos
Biologia Computacional/métodos , Redes Reguladoras de Genes , Modelos Genéticos , Aprendizado de Máquina não Supervisionado , Biologia Computacional/instrumentação , Árvores de Decisões , Regulação da Expressão Gênica
10.
JMIR Mhealth Uhealth ; 5(8): e112, 2017 Aug 10.
Artigo em Inglês | MEDLINE | ID: mdl-28798010

RESUMO

BACKGROUND: Is someone at home, at their friend's place, at a restaurant, or enjoying the outdoors? Knowing the semantic location of an individual matters for delivering medical interventions, recommendations, and other context-aware services. This knowledge is particularly useful in mental health care for monitoring relevant behavioral indicators to improve treatment delivery. Local search-and-discovery services such as Foursquare can be used to detect semantic locations based on the global positioning system (GPS) coordinates, but GPS alone is often inaccurate. Mobile phones can also sense other signals (such as movement, light, and sound), and the use of these signals promises to lead to a better estimation of an individual's semantic location. OBJECTIVE: We aimed to examine the ability of mobile phone sensors to estimate semantic locations, and to evaluate the relationship between semantic location visit patterns and depression and anxiety. METHODS: A total of 208 participants across the United States were asked to log the type of locations they visited daily, using their mobile phones for a period of 6 weeks, while their phone sensor data was recorded. Using the sensor data and Foursquare queries based on GPS coordinates, we trained models to predict these logged locations, and evaluated their prediction accuracy on participants that models had not seen during training. We also evaluated the relationship between the amount of time spent in each semantic location and depression and anxiety assessed at baseline, in the middle, and at the end of the study. RESULTS: While Foursquare queries detected true semantic locations with an average area under the curve (AUC) of 0.62, using phone sensor data alone increased the AUC to 0.84. When we used Foursquare and sensor data together, the AUC further increased to 0.88. We found some significant relationships between the time spent in certain locations and depression and anxiety, although these relationships were not consistent. CONCLUSIONS: The accuracy of location services such as Foursquare can significantly benefit from using phone sensor data. However, our results suggest that the nature of the places people visit explains only a small part of the variation in their anxiety and depression symptoms.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA