Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
Más filtros













Base de datos
Intervalo de año de publicación
1.
bioRxiv ; 2024 May 09.
Artículo en Inglés | MEDLINE | ID: mdl-38766268

RESUMEN

Recent advances in cytometry technology have enabled high-throughput data collection with multiple single-cell protein expression measurements. The significant biological and technical variance between samples in cytometry has long posed a formidable challenge during the gating process, especially for the initial gates which deal with unpredictable events, such as debris and technical artifacts. Even with the same experimental machine and protocol, the target population, as well as the cell population that needs to be excluded, may vary across different measurements. To address this challenge and mitigate the labor-intensive manual gating process, we propose a deep learning framework UNITO to rigorously identify the hierarchical cytometric subpopulations. The UNITO framework transformed a cell-level classification task into an image-based semantic segmentation problem. For reproducibility purposes, the framework was applied to three independent cohorts and successfully detected initial gates that were required to identify single cellular events as well as subsequent cell gates. We validated the UNITO framework by comparing its results with previous automated methods and the consensus of at least four experienced immunologists. UNITO outperformed existing automated methods and differed from human consensus by no more than each individual human. Most critically, UNITO framework functions as a fully automated pipeline after training and does not require human hints or prior knowledge. Unlike existing multi-channel classification or clustering pipelines, UNITO can reproduce a similar contour compared to manual gating for each intermediate gating to achieve better interpretability and provide post hoc visual inspection. Beyond acting as a pioneering framework that uses image segmentation to do auto-gating, UNITO gives a fast and interpretable way to assign the cell subtype membership, and the speed of UNITO will not be impacted by the number of cells from each sample. The pre-gating and gating inference takes approximately 2 minutes for each sample using our pre-defined 9 gates system, and it can also adapt to any sequential prediction with different configurations.

2.
bioRxiv ; 2024 Feb 06.
Artículo en Inglés | MEDLINE | ID: mdl-38370767

RESUMEN

Single-cell technologies have emerged as a transformative technology enabling high-dimensional characterization of cell populations at an unprecedented scale. The data's innate complexity and voluminous nature pose significant computational and analytical challenges, especially in comparative studies delineating cellular architectures across various biological conditions (i.e., generation of sample level distance matrices). Optimal Transport (OT) is a mathematical tool that captures the intrinsic structure of data geometrically and has been applied to many bioinformatics tasks. In this paper, we propose QOT (Quantized Optimal Transport), a new method enables efficient computation of sample level distance matrix from large-scale single-cell omics data through a quantization step. We apply our algorithm to real-world single-cell genomics and pathomics datasets, aiming to extrapolate cell-level insights to inform sample level categorizations. Our empirical study shows that QOT outperforms OT-based algorithms in terms of accuracy and robustness when obtaining a distance matrix at the sample level from high throughput single-cell measures. Moreover, the sample level distance matrix could be used in downstream analysis (i.e. uncover the trajectory of disease progression), highlighting its usage in biomedical informatics and data science.

3.
J Hum Hypertens ; 37(10): 898-906, 2023 10.
Artículo en Inglés | MEDLINE | ID: mdl-36528682

RESUMEN

The study characterises vascular phenotypes of hypertensive patients utilising machine learning approaches. Newly diagnosed and treatment-naïve primary hypertensive patients without co-morbidities (aged 18-55, n = 73), and matched normotensive controls (n = 79) were recruited (NCT04015635). Blood pressure (BP) and BP variability were determined using 24 h ambulatory monitoring. Vascular phenotyping included SphygmoCor® measurement of pulse wave velocity (PWV), pulse wave analysis-derived augmentation index (PWA-AIx), and central BP; EndoPAT™-2000® provided reactive hyperaemia index (LnRHI) and augmentation index adjusted to heart rate of 75bpm. Ultrasound was used to analyse flow mediated dilatation and carotid intima-media thickness (CIMT). In addition to standard statistical methods to compare normotensive and hypertensive groups, machine learning techniques including biclustering explored hypertensive phenotypic subgroups. We report that arterial stiffness (PWV, PWA-AIx, EndoPAT-2000-derived AI@75) and central pressures were greater in incident hypertension than normotension. Endothelial function, percent nocturnal dip, and CIMT did not differ between groups. The vascular phenotype of white-coat hypertension imitated sustained hypertension with elevated arterial stiffness and central pressure; masked hypertension demonstrating values similar to normotension. Machine learning revealed three distinct hypertension clusters, representing 'arterially stiffened', 'vaso-protected', and 'non-dipper' patients. Key clustering features were nocturnal- and central-BP, percent dipping, and arterial stiffness measures. We conclude that untreated patients with primary hypertension demonstrate early arterial stiffening rather than endothelial dysfunction or CIMT alterations. Phenotypic heterogeneity in nocturnal and central BP, percent dipping, and arterial stiffness observed early in the course of disease may have implications for risk stratification.


Asunto(s)
Hipertensión , Rigidez Vascular , Humanos , Grosor Intima-Media Carotídeo , Análisis de la Onda del Pulso , Monitoreo Ambulatorio de la Presión Arterial , Hipertensión/diagnóstico , Presión Sanguínea/fisiología , Fenotipo
4.
Sci Adv ; 8(47): eabl4747, 2022 11 25.
Artículo en Inglés | MEDLINE | ID: mdl-36417520

RESUMEN

Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial to determine their scope of application. Here, we introduce the Diverse and Generative ML Benchmark (DIGEN), a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of ML algorithms for classification of binary outcomes. The DIGEN resource consists of 40 mathematical functions that map continuous features to binary targets for creating synthetic datasets. These 40 functions were found using a heuristic algorithm designed to maximize the diversity of performance among multiple popular ML algorithms, thus providing a useful test suite for evaluating and comparing new methods. Access to the generative functions facilitates understanding of why a method performs poorly compared to other algorithms, thus providing ideas for improvement.

5.
J Clin Med ; 11(7)2022 Apr 06.
Artículo en Inglés | MEDLINE | ID: mdl-35407664

RESUMEN

The COVID-19 pandemic has sparked a barrage of primary research and reviews. We investigated the publishing process, time and resource wasting, and assessed the methodological quality of the reviews on artificial intelligence techniques to diagnose COVID-19 in medical images. We searched nine databases from inception until 1 September 2020. Two independent reviewers did all steps of identification, extraction, and methodological credibility assessment of records. Out of 725 records, 22 reviews analysing 165 primary studies met the inclusion criteria. This review covers 174,277 participants in total, including 19,170 diagnosed with COVID-19. The methodological credibility of all eligible studies was rated as critically low: 95% of papers had significant flaws in reporting quality. On average, 7.24 (range: 0-45) new papers were included in each subsequent review, and 14% of studies did not include any new paper into consideration. Almost three-quarters of the studies included less than 10% of available studies. More than half of the reviews did not comment on the previously published reviews at all. Much wasting time and resources could be avoided if referring to previous reviews and following methodological guidelines. Such information chaos is alarming. It is high time to draw conclusions from what we experienced and prepare for future pandemics.

6.
Adv Neural Inf Process Syst ; 2021(DB1): 1-16, 2021 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-38715933

RESUMEN

Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. We address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14 symbolic regression methods and 7 machine learning methods on a set of 252 diverse regression problems. Our assessment includes both real-world datasets with no known model form as well as ground-truth benchmark problems. For the real-world datasets, we benchmark the ability of each method to learn models with low error and low complexity relative to state-of-the-art machine learning methods. For the synthetic problems, we assess each method's ability to find exact solutions in the presence of varying levels of noise. Under these controlled experiments, we conclude that the best performing methods for real-world regression combine genetic algorithms with parameter estimation and/or semantic search drivers. When tasked with recovering exact equations in the presence of noise, we find that several approaches perform similarly. We provide a detailed guide to reproducing this experiment and contributing new methods, and encourage other researchers to collaborate with us on a common and living symbolic regression benchmark.

7.
BioData Min ; 12: 14, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31320928

RESUMEN

BACKGROUND: The principal line of investigation in Genome Wide Association Studies (GWAS) is the identification of main effects, that is individual Single Nucleotide Polymorphisms (SNPs) which are associated with the trait of interest, independent of other factors. A variety of methods have been proposed to this end, mostly statistical in nature and differing in assumptions and type of model employed. Moreover, for a given model, there may be multiple choices for the SNP genotype encoding. As an alternative to statistical methods, machine learning methods are often applicable. Typically, for a given GWAS, a single approach is selected and utilized to identify potential SNPs of interest. Even when multiple GWAS are combined through meta-analyses within a consortium, each GWAS is typically analyzed with a single approach and the resulting summary statistics are then utilized in meta-analyses. RESULTS: In this work we use as case studies a Type 2 Diabetes (T2D) and a breast cancer GWAS to explore a diversity of applicable approaches spanning different methods and encoding choices. We assess similarity of these approaches based on the derived ranked lists of SNPs and, for each GWAS, we identify a subset of representative approaches that we use as an ensemble to derive a union list of top SNPs. Among these are SNPs which are identified by multiple approaches as well as several SNPs identified by only one or a few of the less frequently used approaches. The latter include SNPs from established loci and SNPs which have other supporting lines of evidence in terms of their potential relevance to the traits. CONCLUSIONS: Not every main effect analysis method is suitable for every GWAS, but for each GWAS there are typically multiple applicable methods and encoding options. We suggest a workflow for a single GWAS, extensible to multiple GWAS from consortia, where representative approaches are selected among a pool of suitable options, to yield a more comprehensive set of SNPs, potentially including SNPs that would typically be missed with the most popular analyses, but that could provide additional valuable insights for follow-up.

8.
Gigascience ; 8(7)2019 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-31251324

RESUMEN

Biclustering is a technique of discovering local similarities within data. For many years the complexity of the methods and parallelization issues limited its application to big data problems. With the development of novel scalable methods, biclustering has finally started to close this gap. In this paper we discuss the caveats of biclustering and present its current challenges and guidelines for practitioners. We also try to explain why biclustering may soon become one of the standards for big data analytics.


Asunto(s)
Macrodatos , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Análisis por Conglomerados , Minería de Datos/métodos , Genoma Humano , Genómica/normas , Humanos , Alineación de Secuencia/métodos , Alineación de Secuencia/normas , Análisis de Secuencia de ADN/normas , Programas Informáticos
9.
Bioinformatics ; 35(17): 3181-3183, 2019 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-30649199

RESUMEN

MOTIVATION: In this paper, we present an open source package with the latest release of Evolutionary-based BIClustering (EBIC), a next-generation biclustering algorithm for mining genetic data. The major contribution of this paper is adding a full support for multiple graphics processing units (GPUs) support, which makes it possible to run efficiently large genomic data mining analyses. Multiple enhancements to the first release of the algorithm include integration with R and Bioconductor, and an option to exclude missing values from the analysis. RESULTS: Evolutionary-based BIClustering was applied to datasets of different sizes, including a large DNA methylation dataset with 436 444 rows. For the largest dataset we observed over 6.6-fold speedup in computation time on a cluster of eight GPUs compared to running the method on a single GPU. This proves high scalability of the method. AVAILABILITY AND IMPLEMENTATION: The latest version of EBIC could be downloaded from http://github.com/EpistasisLab/ebic. Installation and usage instructions are also available online. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Análisis de Datos , Programas Informáticos , Algoritmos , Metilación de ADN , Genómica
10.
Bioinformatics ; 34(24): 4302-4304, 2018 12 15.
Artículo en Inglés | MEDLINE | ID: mdl-29939213

RESUMEN

Motivation: Biclustering is an unsupervised technique of simultaneous clustering of rows and columns of input matrix. With multiple biclustering algorithms proposed, UniBic remains one of the most accurate methods developed so far. Results: In this paper we introduce a Bioconductor package called runibic with parallel implementation of UniBic. For the convenience the algorithm was reimplemented, parallelized and wrapped within an R package called runibic. The package includes: (i) a couple of times faster parallel version of the original sequential algorithm, (ii) much more efficient memory management, (iii) modularity which allows to build new methods on top of the provided one and (iv) integration with the modern Bioconductor packages such as SummarizedExperiment, ExpressionSet and biclust. Availability and implementation: The package is implemented in R and is available from Bioconductor (starting from version 3.6) at the following URL http://bioconductor.org/packages/runibic with installation instructions and tutorial. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Perfilación de la Expresión Génica/métodos , Programas Informáticos , Análisis por Conglomerados , Biología Computacional , Expresión Génica
11.
Bioinformatics ; 34(21): 3719-3726, 2018 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-29790909

RESUMEN

Motivation: Biclustering algorithms are commonly used for gene expression data analysis. However, accurate identification of meaningful structures is very challenging and state-of-the-art methods are incapable of discovering with high accuracy different patterns of high biological relevance. Results: In this paper, a novel biclustering algorithm based on evolutionary computation, a sub-field of artificial intelligence, is introduced. The method called EBIC aims to detect order-preserving patterns in complex data. EBIC is capable of discovering multiple complex patterns with unprecedented accuracy in real gene expression datasets. It is also one of the very few biclustering methods designed for parallel environments with multiple graphics processing units. We demonstrate that EBIC greatly outperforms state-of-the-art biclustering methods, in terms of recovery and relevance, on both synthetic and genetic datasets. EBIC also yields results over 12 times faster than the most accurate reference algorithms. Availability and implementation: EBIC source code is available on GitHub at https://github.com/EpistasisLab/ebic. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Inteligencia Artificial , Análisis por Conglomerados , Perfilación de la Expresión Génica , Programas Informáticos
12.
Pac Symp Biocomput ; 23: 123-132, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-29218875

RESUMEN

Electronic Health Records (EHRs) contain a wealth of patient data useful to biomedical researchers. At present, both the extraction of data and methods for analyses are frequently designed to work with a single snapshot of a patient's record. Health care providers often perform and record actions in small batches over time. By extracting these care events, a sequence can be formed providing a trajectory for a patient's interactions with the health care system. These care events also offer a basic heuristic for the level of attention a patient receives from health care providers. We show that is possible to learn meaningful embeddings from these care events using two deep learning techniques, unsupervised autoencoders and long short-term memory networks. We compare these methods to traditional machine learning methods which require a point in time snapshot to be extracted from an EHR.


Asunto(s)
Cuidados Críticos/estadística & datos numéricos , Aprendizaje Automático/estadística & datos numéricos , Biología Computacional/métodos , Bases de Datos Factuales/estadística & datos numéricos , Registros Electrónicos de Salud/estadística & datos numéricos , Femenino , Humanos , Masculino , Aprendizaje Automático Supervisado/estadística & datos numéricos , Aprendizaje Automático no Supervisado/estadística & datos numéricos
13.
Pac Symp Biocomput ; 23: 460-471, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-29218905

RESUMEN

With the maturation of metabolomics science and proliferation of biobanks, clinical metabolic profiling is an increasingly opportunistic frontier for advancing translational clinical research. Automated Machine Learning (AutoML) approaches provide exciting opportunity to guide feature selection in agnostic metabolic profiling endeavors, where potentially thousands of independent data points must be evaluated. In previous research, AutoML using high-dimensional data of varying types has been demonstrably robust, outperforming traditional approaches. However, considerations for application in clinical metabolic profiling remain to be evaluated. Particularly, regarding the robustness of AutoML to identify and adjust for common clinical confounders. In this study, we present a focused case study regarding AutoML considerations for using the Tree-Based Optimization Tool (TPOT) in metabolic profiling of exposure to metformin in a biobank cohort. First, we propose a tandem rank-accuracy measure to guide agnostic feature selection and corresponding threshold determination in clinical metabolic profiling endeavors. Second, while AutoML, using default parameters, demonstrated potential to lack sensitivity to low-effect confounding clinical covariates, we demonstrated residual training and adjustment of metabolite features as an easily applicable approach to ensure AutoML adjustment for potential confounding characteristics. Finally, we present increased homocysteine with long-term exposure to metformin as a potentially novel, non-replicated metabolite association suggested by TPOT; an association not identified in parallel clinical metabolic profiling endeavors. While warranting independent replication, our tandem rank-accuracy measure suggests homocysteine to be the metabolite feature with largest effect, and corresponding priority for further translational clinical research. Residual training and adjustment for a potential confounding effect by BMI only slightly modified the suggested association. Increased homocysteine is thought to be associated with vitamin B12 deficiency - evaluation for potential clinical relevance is suggested. While considerations for clinical metabolic profiling are recommended, including adjustment approaches for clinical confounders, AutoML presents an exciting tool to enhance clinical metabolic profiling and advance translational research endeavors.


Asunto(s)
Homocisteína/sangre , Hipoglucemiantes/efectos adversos , Metaboloma , Metformina/efectos adversos , Aprendizaje Automático Supervisado/estadística & datos numéricos , Sesgo , Índice de Masa Corporal , Estudios de Casos y Controles , Biología Computacional/métodos , Diabetes Mellitus Tipo 2/sangre , Diabetes Mellitus Tipo 2/tratamiento farmacológico , Humanos , Metabolómica/estadística & datos numéricos , Factores de Riesgo , Investigación Biomédica Traslacional
14.
BioData Min ; 10: 36, 2017.
Artículo en Inglés | MEDLINE | ID: mdl-29238404

RESUMEN

BACKGROUND: The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists. RESULTS: The present study introduces an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies. We compare meta-features among the current set of benchmark datasets in this resource to characterize the diversity of available data. Finally, we apply a number of established machine learning methods to the entire benchmark suite and analyze how datasets and algorithms cluster in terms of performance. From this study, we find that existing benchmarks lack the diversity to properly benchmark machine learning algorithms, and there are several gaps in benchmarking problems that still need to be considered. CONCLUSIONS: This work represents another important step towards understanding the limitations of popular benchmarking suites and developing a resource that connects existing benchmarking standards to more diverse and efficient standards in the future.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA