Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
1.
PLoS One ; 16(11): e0260315, 2021.
Article in English | MEDLINE | ID: mdl-34797894

ABSTRACT

Overdose prescription errors sometimes cause serious life-threatening adverse drug events, while underdose errors lead to diminished therapeutic effects. Therefore, it is important to detect and prevent these errors. In the present study, we used the one-class support vector machine (OCSVM), one of the most common unsupervised machine learning algorithms for anomaly detection, to identify overdose and underdose prescriptions. We extracted prescription data from electronic health records in Kyushu University Hospital between January 1, 2014 and December 31, 2019. We constructed an OCSVM model for each of the 21 candidate drugs using three features: age, weight, and dose. Clinical overdose and underdose prescriptions, which were identified and rectified by pharmacists before administration, were collected. Synthetic overdose and underdose prescriptions were created using the maximum and minimum doses, defined by drug labels or the UpToDate database. We applied these prescription data to the OCSVM model and evaluated its detection performance. We also performed comparative analysis with other unsupervised outlier detection algorithms (local outlier factor, isolation forest, and robust covariance). Twenty-seven out of 31 clinical overdose and underdose prescriptions (87.1%) were detected as abnormal by the model. The constructed OCSVM models showed high performance for detecting synthetic overdose prescriptions (precision 0.986, recall 0.964, and F-measure 0.973) and synthetic underdose prescriptions (precision 0.980, recall 0.794, and F-measure 0.839). In comparative analysis, OCSVM showed the best performance. Our models detected the majority of clinical overdose and underdose prescriptions and demonstrated high performance in synthetic data analysis. OCSVM models, constructed using features such as age, weight, and dose, are useful for detecting overdose and underdose prescriptions.


Subject(s)
Drug Overdose/diagnosis , Prescription Drugs/adverse effects , Prescriptions/statistics & numerical data , Adolescent , Adult , Aged , Aged, 80 and over , Algorithms , Child, Preschool , Data Analysis , Data Collection/statistics & numerical data , Data Management/statistics & numerical data , Databases, Factual/statistics & numerical data , Electronic Health Records/statistics & numerical data , Humans , Infant , Mental Recall , Middle Aged , Support Vector Machine/statistics & numerical data , Unsupervised Machine Learning/statistics & numerical data , Young Adult
2.
PLoS Comput Biol ; 17(9): e1009439, 2021 09.
Article in English | MEDLINE | ID: mdl-34550974

ABSTRACT

Recent neuroscience studies demonstrate that a deeper understanding of brain function requires a deeper understanding of behavior. Detailed behavioral measurements are now often collected using video cameras, resulting in an increased need for computer vision algorithms that extract useful information from video data. Here we introduce a new video analysis tool that combines the output of supervised pose estimation algorithms (e.g. DeepLabCut) with unsupervised dimensionality reduction methods to produce interpretable, low-dimensional representations of behavioral videos that extract more information than pose estimates alone. We demonstrate this tool by extracting interpretable behavioral features from videos of three different head-fixed mouse preparations, as well as a freely moving mouse in an open field arena, and show how these interpretable features can facilitate downstream behavioral and neural analyses. We also show how the behavioral features produced by our model improve the precision and interpretation of these downstream analyses compared to using the outputs of either fully supervised or fully unsupervised methods alone.


Subject(s)
Algorithms , Artificial Intelligence/statistics & numerical data , Behavior, Animal , Video Recording , Animals , Computational Biology , Computer Simulation , Markov Chains , Mice , Models, Statistical , Neural Networks, Computer , Supervised Machine Learning/statistics & numerical data , Unsupervised Machine Learning/statistics & numerical data , Video Recording/statistics & numerical data
3.
Nat Commun ; 12(1): 1029, 2021 02 15.
Article in English | MEDLINE | ID: mdl-33589635

ABSTRACT

A primary challenge in single-cell RNA sequencing (scRNA-seq) studies comes from the massive amount of data and the excess noise level. To address this challenge, we introduce an analysis framework, named single-cell Decomposition using Hierarchical Autoencoder (scDHA), that reliably extracts representative information of each cell. The scDHA pipeline consists of two core modules. The first module is a non-negative kernel autoencoder able to remove genes or components that have insignificant contributions to the part-based representation of the data. The second module is a stacked Bayesian autoencoder that projects the data onto a low-dimensional space (compressed). To diminish the tendency to overfit of neural networks, we repeatedly perturb the compressed space to learn a more generalized representation of the data. In an extensive analysis, we demonstrate that scDHA outperforms state-of-the-art techniques in many research sub-fields of scRNA-seq analysis, including cell segregation through unsupervised learning, visualization of transcriptome landscape, cell classification, and pseudo-time inference.


Subject(s)
Neural Networks, Computer , Sequence Analysis, RNA/statistics & numerical data , Single-Cell Analysis/statistics & numerical data , Unsupervised Machine Learning/statistics & numerical data , Animals , Bayes Theorem , Benchmarking , Cell Separation/methods , Cerebellum/chemistry , Cerebellum/cytology , Embryo, Mammalian , Humans , Liver/chemistry , Liver/cytology , Lung/chemistry , Lung/cytology , Mice , Mouse Embryonic Stem Cells/chemistry , Mouse Embryonic Stem Cells/cytology , Pancreas/chemistry , Pancreas/cytology , Retina/chemistry , Retina/cytology , Single-Cell Analysis/methods , Visual Cortex/chemistry , Visual Cortex/cytology , Zygote/chemistry , Zygote/cytology
4.
Nat Protoc ; 16(2): 754-774, 2021 02.
Article in English | MEDLINE | ID: mdl-33424024

ABSTRACT

Cell morphology encodes essential information on many underlying biological processes. It is commonly used by clinicians and researchers in the study, diagnosis, prognosis, and treatment of human diseases. Quantification of cell morphology has seen tremendous advances in recent years. However, effectively defining morphological shapes and evaluating the extent of morphological heterogeneity within cell populations remain challenging. Here we present a protocol and software for the analysis of cell and nuclear morphology from fluorescence or bright-field images using the VAMPIRE algorithm ( https://github.com/kukionfr/VAMPIRE_open ). This algorithm enables the profiling and classification of cells into shape modes based on equidistant points along cell and nuclear contours. Examining the distributions of cell morphologies across automatically identified shape modes provides an effective visualization scheme that relates cell shapes to cellular subtypes based on endogenous and exogenous cellular conditions. In addition, these shape mode distributions offer a direct and quantitative way to measure the extent of morphological heterogeneity within cell populations. This protocol is highly automated and fast, with the ability to quantify the morphologies from 2D projections of cells seeded both on 2D substrates or embedded within 3D microenvironments, such as hydrogels and tissues. The complete analysis pipeline can be completed within 60 minutes for a dataset of ~20,000 cells/2,400 images.


Subject(s)
Cell Shape/physiology , Imaging, Three-Dimensional/methods , Microscopy, Confocal/methods , Algorithms , Cell Nucleus/physiology , Humans , Software , Unsupervised Machine Learning/statistics & numerical data
5.
Genes (Basel) ; 11(7)2020 07 14.
Article in English | MEDLINE | ID: mdl-32674393

ABSTRACT

As single-cell RNA sequencing technologies mature, massive gene expression profiles can be obtained. Consequently, cell clustering and annotation become two crucial and fundamental procedures affecting other specific downstream analyses. Most existing single-cell RNA-seq (scRNA-seq) data clustering algorithms do not take into account the available cell annotation results on the same tissues or organisms from other laboratories. Nonetheless, such data could assist and guide the clustering process on the target dataset. Identifying marker genes through differential expression analysis to manually annotate large amounts of cells also costs labor and resources. Therefore, in this paper, we propose a novel end-to-end cell supervised clustering and annotation framework called scAnCluster, which fully utilizes the cell type labels available from reference data to facilitate the cell clustering and annotation on the unlabeled target data. Our algorithm integrates deep supervised learning, self-supervised learning and unsupervised learning techniques together, and it outperforms other customized scRNA-seq supervised clustering methods in both simulation and real data. It is particularly worth noting that our method performs well on the challenging task of discovering novel cell types that are absent in the reference data.


Subject(s)
Molecular Sequence Annotation , RNA-Seq/methods , Single-Cell Analysis/methods , Transcriptome/genetics , Cluster Analysis , Computer Simulation , Gene Expression Profiling , Genetic Markers/genetics , RNA-Seq/statistics & numerical data , Sequence Analysis, RNA/methods , Sequence Analysis, RNA/statistics & numerical data , Single-Cell Analysis/statistics & numerical data , Unsupervised Machine Learning/statistics & numerical data , Exome Sequencing/methods , Exome Sequencing/statistics & numerical data
6.
J Comput Biol ; 27(9): 1337-1340, 2020 09.
Article in English | MEDLINE | ID: mdl-31905016

ABSTRACT

The increasing availability of complex data in biology and medicine has promoted the use of machine learning in classification tasks to address important problems in translational and fundamental science. Two important obstacles, however, may limit the unraveling of the full potential of machine learning in these fields: the lack of generalization of the resulting models and the limited number of labeled data sets in some applications. To address these important problems, we developed an unsupervised ensemble algorithm called strategy for unsupervised multiple method aggregation (SUMMA). By virtue of being an ensemble method, SUMMA is more robust to generalization than the predictions it combines. By virtue of being unsupervised, SUMMA does not require labeled data. SUMMA receives as input predictions from a diversity of models and estimates their classification performance even when labeled data are unavailable. It then uses these performance estimates to combine these different predictions into an ensemble model. SUMMA can be applied to a variety of binary classification problems in bioinformatics including but not limited to gene network inference, cancer diagnostics, drug response prediction, somatic mutation, and differential expression calling. In this application note, we introduce the R/PY-SUMMA packages, available in R or Python, that implement the SUMMA algorithm.


Subject(s)
Computational Biology/statistics & numerical data , Gene Regulatory Networks/genetics , Unsupervised Machine Learning/statistics & numerical data , Algorithms , Models, Statistical
7.
PLoS Comput Biol ; 15(4): e1006937, 2019 04.
Article in English | MEDLINE | ID: mdl-30973878

ABSTRACT

Gestational alcohol exposure causes fetal alcohol spectrum disorder (FASD) and is a prominent cause of neurodevelopmental disability. Whole transcriptome sequencing (RNA-Seq) offer insights into mechanisms underlying FASD, but gene-level analysis provides limited information regarding complex transcriptional processes such as alternative splicing and non-coding RNAs. Moreover, traditional analytical approaches that use multiple hypothesis testing with a false discovery rate adjustment prioritize genes based on an adjusted p-value, which is not always biologically relevant. We address these limitations with a novel approach and implemented an unsupervised machine learning model, which we applied to an exon-level analysis to reduce data complexity to the most likely functionally relevant exons, without loss of novel information. This was performed on an RNA-Seq paired-end dataset derived from alcohol-exposed neural fold-stage chick crania, wherein alcohol causes facial deficits recapitulating those of FASD. A principal component analysis along with k-means clustering was utilized to extract exons that deviated from baseline expression. This identified 6857 differentially expressed exons representing 1251 geneIDs; 391 of these genes were identified in a prior gene-level analysis of this dataset. It also identified exons encoding 23 microRNAs (miRNAs) having significantly differential expression profiles in response to alcohol. We developed an RDAVID pipeline to identify KEGG pathways represented by these exons, and separately identified predicted KEGG pathways targeted by these miRNAs. Several of these (ribosome biogenesis, oxidative phosphorylation) were identified in our prior gene-level analysis. Other pathways are crucial to facial morphogenesis and represent both novel (focal adhesion, FoxO signaling, insulin signaling) and known (Wnt signaling) alcohol targets. Importantly, there was substantial overlap between the exomes themselves and the predicted miRNA targets, suggesting these miRNAs contribute to the gene-level expression changes. Our novel application of unsupervised machine learning in conjunction with statistical analyses facilitated the discovery of signaling pathways and miRNAs that inform mechanisms underlying FASD.


Subject(s)
Exons/genetics , Fetal Alcohol Spectrum Disorders/genetics , MicroRNAs/genetics , Unsupervised Machine Learning , Animals , Big Data , Chick Embryo , Cluster Analysis , Computational Biology , Databases, Nucleic Acid/statistics & numerical data , Disease Models, Animal , Ethanol/toxicity , Female , Gene Expression Profiling/statistics & numerical data , Humans , Pregnancy , Principal Component Analysis , Unsupervised Machine Learning/statistics & numerical data
8.
Behav Sci Law ; 37(3): 214-222, 2019 May.
Article in English | MEDLINE | ID: mdl-30609102

ABSTRACT

For decades, our ability to predict suicide has remained at near-chance levels. Machine learning has recently emerged as a promising tool for advancing suicide science, particularly in the domain of suicide prediction. The present review provides an introduction to machine learning and its potential application to open questions in suicide research. Although only a few studies have implemented machine learning for suicide prediction, results to date indicate considerable improvement in accuracy and positive predictive value. Potential barriers to algorithm integration into clinical practice are discussed, as well as attendant ethical issues. Overall, machine learning approaches hold promise for accurate, scalable, and effective suicide risk detection; however, many critical questions and issues remain unexplored.


Subject(s)
Ethics, Medical , Machine Learning/legislation & jurisprudence , Suicide/ethics , Suicide/legislation & jurisprudence , Algorithms , Cluster Analysis , Decision Support Techniques , Humans , Longitudinal Studies , Machine Learning/ethics , Probability , Research , Risk Assessment/legislation & jurisprudence , Unsupervised Machine Learning/ethics , Unsupervised Machine Learning/legislation & jurisprudence , Unsupervised Machine Learning/statistics & numerical data , Suicide Prevention
9.
Brief Bioinform ; 20(4): 1269-1279, 2019 07 19.
Article in English | MEDLINE | ID: mdl-29272335

ABSTRACT

With the recent developments in the field of multi-omics integration, the interest in factors such as data preprocessing, choice of the integration method and the number of different omics considered had increased. In this work, the impact of these factors is explored when solving the problem of sample classification, by comparing the performances of five unsupervised algorithms: Multiple Canonical Correlation Analysis, Multiple Co-Inertia Analysis, Multiple Factor Analysis, Joint and Individual Variation Explained and Similarity Network Fusion. These methods were applied to three real data sets taken from literature and several ad hoc simulated scenarios to discuss classification performance in different conditions of noise and signal strength across the data types. The impact of experimental design, feature selection and parameter training has been also evaluated to unravel important conditions that can affect the accuracy of the result.


Subject(s)
Computational Biology/methods , Systems Integration , Unsupervised Machine Learning , Algorithms , Animals , Cluster Analysis , Computer Simulation , Databases, Factual , Factor Analysis, Statistical , Genomics/statistics & numerical data , Humans , Metabolomics/statistics & numerical data , Mice , Models, Biological , Multivariate Analysis , Proteomics/statistics & numerical data , Systems Biology , Unsupervised Machine Learning/statistics & numerical data
10.
Comput Inform Nurs ; 36(5): 242-248, 2018 May.
Article in English | MEDLINE | ID: mdl-29494361

ABSTRACT

This study explored the use of unsupervised machine learning to identify subgroups of patients with heart failure who used telehealth services in the home health setting, and examined intercluster differences for patient characteristics related to medical history, symptoms, medications, psychosocial assessments, and healthcare utilization. Using a feature selection algorithm, we selected seven variables from 557 patients for clustering. We tested three clustering techniques: hierarchical, k-means, and partitioning around medoids. Hierarchical clustering was identified as the best technique using internal validation methods. Intercluster differences among patient characteristics and outcomes were assessed with either χ test or one-way analysis of variance. Ranging in size from 153 to 233 patients, three clusters displayed patterns that differed significantly (P < .05) in patient characteristics of age, sex, medical history of comorbid conditions, use of beta blockers, and quality of life assessment. Significant (P < .001) intercluster differences in number of medications, comorbidities, and healthcare utilization were also revealed. The study identified patterns of association between (1) mental health status, pulmonary disorders, and obesity, and (2) healthcare utilization for patients with heart failure who used telehealth in the home health setting. Study results also revealed a lack of prescription guideline-recommended heart failure medications for the subgroup with the highest proportion of older female adults.


Subject(s)
Heart Failure/classification , Home Care Services/statistics & numerical data , Patient Acceptance of Health Care , Telemedicine , Unsupervised Machine Learning/statistics & numerical data , Aged , Aged, 80 and over , Comorbidity , Female , Humans , Male , Models, Statistical , Retrospective Studies
11.
Pac Symp Biocomput ; 23: 123-132, 2018.
Article in English | MEDLINE | ID: mdl-29218875

ABSTRACT

Electronic Health Records (EHRs) contain a wealth of patient data useful to biomedical researchers. At present, both the extraction of data and methods for analyses are frequently designed to work with a single snapshot of a patient's record. Health care providers often perform and record actions in small batches over time. By extracting these care events, a sequence can be formed providing a trajectory for a patient's interactions with the health care system. These care events also offer a basic heuristic for the level of attention a patient receives from health care providers. We show that is possible to learn meaningful embeddings from these care events using two deep learning techniques, unsupervised autoencoders and long short-term memory networks. We compare these methods to traditional machine learning methods which require a point in time snapshot to be extracted from an EHR.


Subject(s)
Critical Care/statistics & numerical data , Machine Learning/statistics & numerical data , Computational Biology/methods , Databases, Factual/statistics & numerical data , Electronic Health Records/statistics & numerical data , Female , Humans , Male , Supervised Machine Learning/statistics & numerical data , Unsupervised Machine Learning/statistics & numerical data
12.
Addict Behav ; 65: 289-295, 2017 02.
Article in English | MEDLINE | ID: mdl-27568339

ABSTRACT

INTRODUCTION: Nonmedical use of prescription medications/drugs (NMUPD) is a serious public health threat, particularly in relation to the prescription opioid analgesics abuse epidemic. While attention to this problem has been growing, there remains an urgent need to develop novel strategies in the field of "digital epidemiology" to better identify, analyze and understand trends in NMUPD behavior. METHODS: We conducted surveillance of the popular microblogging site Twitter by collecting 11 million tweets filtered for three commonly abused prescription opioid analgesic drugs Percocet® (acetaminophen/oxycodone), OxyContin® (oxycodone), and Oxycodone. Unsupervised machine learning was applied on the subset of tweets for each analgesic drug to discover underlying latent themes regarding risk behavior. A two-step process of obtaining themes, and filtering out unwanted tweets was carried out in three subsequent rounds of machine learning. RESULTS: Using this methodology, 2.3M tweets were identified that contained content relevant to analgesic NMUPD. The underlying themes were identified for each drug and the most representative tweets of each theme were annotated for NMUPD behavioral risk factors. The primary themes identified evidence high levels of social media discussion about polydrug abuse on Twitter. This included specific mention of various polydrug combinations including use of other classes of prescription drugs, and illicit drug abuse. CONCLUSIONS: This study presents a methodology to filter Twitter content for NMUPD behavior, while also identifying underlying themes with minimal human intervention. Results from the study track accurately with the inclusion/exclusion criteria used to isolate NMUPD-related risk behaviors of interest and also provides insight on NMUPD behavior that has a high level of social media engagement. Results suggest that this could be a viable methodology for use in big data substance abuse surveillance, data collection, and analysis in comparison to other studies that rely upon content analysis and human coding schemes.


Subject(s)
Opioid-Related Disorders/epidemiology , Prescription Drug Misuse/statistics & numerical data , Social Media/statistics & numerical data , Unsupervised Machine Learning/statistics & numerical data , Humans , Risk Factors
13.
Pac Symp Biocomput ; 21: 504-15, 2016.
Article in English | MEDLINE | ID: mdl-26776213

ABSTRACT

Online social media microblogs may be a valuable resource for timely identification of critical ad hoc health-related incidents or serious epidemic outbreaks. In this paper, we explore emotion classification of Twitter microblogs related to localized public health threats, and study whether the public mood can be effectively utilized in early discovery or alarming of such events. We analyse user tweets around recent incidents of Ebola, finding differences in the expression of emotions in tweets posted prior to and after the incidents have emerged. We also analyse differences in the nature of the tweets in the immediately affected area as compared to areas remote to the events. The results of this analysis suggest that emotions in social media microblogging data (from Twitter in particular) may be utilized effectively as a source of evidence for disease outbreak detection and monitoring.


Subject(s)
Emotions/classification , Public Health Surveillance/methods , Social Media/statistics & numerical data , Bayes Theorem , Computational Biology/methods , Computational Biology/statistics & numerical data , Disease Outbreaks/statistics & numerical data , Hemorrhagic Fever, Ebola/epidemiology , Hemorrhagic Fever, Ebola/psychology , Humans , Time Factors , Unsupervised Machine Learning/statistics & numerical data
14.
Comput Methods Programs Biomed ; 119(3): 163-80, 2015 May.
Article in English | MEDLINE | ID: mdl-25843807

ABSTRACT

Gene expression data analysis is based on the assumption that co-expressed genes imply co-regulated genes. This assumption is being reformulated because the co-expression of a group of genes may be the result of an independent activation with respect to the same experimental condition and not due to the same regulatory regime. For this reason, traditional techniques are recently being improved with the use of prior biological knowledge from open-access repositories together with gene expression data. Biclustering is an unsupervised machine learning technique that searches patterns in gene expression data matrices. A scatter search-based biclustering algorithm that integrates biological information is proposed in this paper. In addition to the gene expression data matrix, the input of the algorithm is only a direct annotation file that relates each gene to a set of terms from a biological repository where genes are annotated. Two different biological measures, FracGO and SimNTO, are proposed to integrate this information by means of its addition to-be-optimized fitness function in the scatter search scheme. The measure FracGO is based on the biological enrichment and SimNTO is based on the overlapping among GO annotations of pairs of genes. Experimental results evaluate the proposed algorithm for two datasets and show the algorithm performs better when biological knowledge is integrated. Moreover, the analysis and comparison between the two different biological measures is presented and it is concluded that the differences depend on both the data source and how the annotation file has been built in the case GO is used. It is also shown that the proposed algorithm obtains a greater number of enriched biclusters than other classical biclustering algorithms typically used as benchmark and an analysis of the overlapping among biclusters reveals that the biclusters obtained present a low overlapping. The proposed methodology is a general-purpose algorithm which allows the integration of biological information from several sources and can be extended to other biclustering algorithms based on the optimization of a merit function.


Subject(s)
Algorithms , Gene Expression Profiling/statistics & numerical data , Molecular Sequence Annotation/statistics & numerical data , Unsupervised Machine Learning/statistics & numerical data , Cluster Analysis , Data Mining , Databases, Genetic/statistics & numerical data , Gene Ontology/statistics & numerical data , Genes, Fungal , Knowledge Bases , Yeasts/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...