RESUMO
BACKGROUND: Under- or late identification of pulmonary embolism (PE)-a thrombosis of 1 or more pulmonary arteries that seriously threatens patients' lives-is a major challenge confronting modern medicine. OBJECTIVE: We aimed to establish accurate and informative machine learning (ML) models to identify patients at high risk for PE as they are admitted to the hospital, before their initial clinical checkup, by using only the information in their medical records. METHODS: We collected demographics, comorbidities, and medications data for 2568 patients with PE and 52,598 control patients. We focused on data available prior to emergency department admission, as these are the most universally accessible data. We trained an ML random forest algorithm to detect PE at the earliest possible time during a patient's hospitalization-at the time of his or her admission. We developed and applied 2 ML-based methods specifically to address the data imbalance between PE and non-PE patients, which causes misdiagnosis of PE. RESULTS: The resulting models predicted PE based on age, sex, BMI, past clinical PE events, chronic lung disease, past thrombotic events, and usage of anticoagulants, obtaining an 80% geometric mean value for the PE and non-PE classification accuracies. Although on hospital admission only 4% (1942/46,639) of the patients had a diagnosis of PE, we identified 2 clustering schemes comprising subgroups with more than 61% (705/1120 in clustering scheme 1; 427/701 and 340/549 in clustering scheme 2) positive patients for PE. One subgroup in the first clustering scheme included 36% (705/1942) of all patients with PE who were characterized by a definite past PE diagnosis, a 6-fold higher prevalence of deep vein thrombosis, and a 3-fold higher prevalence of pneumonia, compared with patients of the other subgroups in this scheme. In the second clustering scheme, 2 subgroups (1 of only men and 1 of only women) included patients who all had a past PE diagnosis and a relatively high prevalence of pneumonia, and a third subgroup included only those patients with a past diagnosis of pneumonia. CONCLUSIONS: This study established an ML tool for early diagnosis of PE almost immediately upon hospital admission. Despite the highly imbalanced scenario undermining accurate PE prediction and using information available only from the patient's medical history, our models were both accurate and informative, enabling the identification of patients already at high risk for PE upon hospital admission, even before the initial clinical checkup was performed. The fact that we did not restrict our patients to those at high risk for PE according to previously published scales (eg, Wells or revised Genova scores) enabled us to accurately assess the application of ML on raw medical data and identify new, previously unidentified risk factors for PE, such as previous pulmonary disease, in general populations.
Assuntos
Aprendizado de Máquina , Embolia Pulmonar , Humanos , Embolia Pulmonar/diagnóstico , Masculino , Fatores de Risco , Feminino , Pessoa de Meia-Idade , Idoso , Diagnóstico Precoce , Hospitalização/estatística & dados numéricos , Adulto , Admissão do Paciente/estatística & dados numéricosRESUMO
Deep learning approaches are gradually being applied to electronic health record (EHR) data, but they fail to incorporate medical diagnosis codes and real-valued laboratory tests into a single input sequence for temporal modeling. Therefore, the modeling misses the existing medical interrelations among codes and lab test results that should be exploited to promote early disease detection. To find connections between past diagnoses, represented by medical codes, and real-valued laboratory tests, in order to exploit the full potential of the EHR in medical diagnosis, we present a novel method to embed the two sources of data into a recurrent neural network. Experimenting with a database of Crohn's disease (CD), a type of inflammatory bowel disease, patients and their controls (~1:2.2), we show that the introduction of lab test results improves the network's predictive performance more than the introduction of past diagnoses but also, surprisingly, more than when both are combined. In addition, using bootstrapping, we generalize the analysis of the imbalanced database to a medical condition that simulates real-life prevalence of a high-risk CD group of first-degree relatives with results that make our embedding method ready to screen this group in the population.
Assuntos
Registros Eletrônicos de Saúde , Doenças Inflamatórias Intestinais , Humanos , Redes Neurais de Computação , Bases de Dados Factuais , Doenças Inflamatórias Intestinais/diagnósticoRESUMO
Convolutional neural networks (CNNs) have achieved superior accuracy in many visual-related tasks. However, the inference process through a CNN's intermediate layers is opaque, making it difficult to interpret such networks or develop trust in their operation. In this article, we introduce SIGN method for modeling the network's hidden layer activity using probabilistic models. The activity patterns in layers of interest are modeled as Gaussian mixture models, and transition probabilities between clusters in consecutive modeled layers are estimated to identify paths of inference. For fully connected networks, the entire layer activity is clustered, and the resulting model is a hidden Markov model. For convolutional layers, spatial columns of activity are clustered, and a maximum likelihood model is developed for mining an explanatory inference graph. The graph describes the hierarchy of activity clusters most relevant for network prediction. We show that such inference graphs are useful for understanding the general inference process of a class, as well as explaining the (correct or incorrect) decisions the network makes about specific images. In addition, SIGN provide interesting observations regarding hidden layer activity in general, including the concentration of memorization in a single middle layer in fully connected networks, and a highly local nature of column activities in the top CNN layers.
RESUMO
Responsiveness to levodopa varies greatly among patients with Parkinson's disease (PD). The factors that affect it are ill defined. The aim of the study was to identify factors predictive of long-term response to levodopa. The medical records of 296 patients with PD (mean age of onset, 62.2 ± 9.7 years) were screened for demographics, previous treatments, and clinical phenotypes. All patients were assessed with the Unified PD Rating Scale (UPDRS)-III before and 3 months after levodopa initiation. Regression and machine-learning analyses were used to determine factors that are associated with levodopa responsiveness and might identify patients who will benefit from treatment. The UPDRS-III score improved by ≥ 30% (good response) in 128 patients (43%). On regression analysis, female gender, young age at onset, and early use of dopamine agonists predicted a good response. Time to initiation of levodopa treatment had no effect on responsiveness except in patients older than 72 years, who were less responsive. Machine-learning analysis validated these factors and added several others: symptoms of rigidity and bradykinesia, disease onset in the legs and on the left side, and fewer white vascular ischemic changes, comorbidities, and pre-non-motor symptoms. The main determinants of variations in levodopa responsiveness are gender, age, and clinical phenotype. Early use of dopamine agonists does not hamper levodopa responsiveness. In addition to validating the regression analysis results, machine-learning methods helped to determine the specific clinical phenotype of patients who may benefit from levodopa in terms of comorbidities and pre-motor and non-motor symptoms.
Assuntos
Levodopa , Doença de Parkinson , Antiparkinsonianos/uso terapêutico , Agonistas de Dopamina/uso terapêutico , Feminino , Humanos , Levodopa/uso terapêutico , Aprendizado de Máquina , Doença de Parkinson/complicaçõesRESUMO
Lipid profiles in biological fluids from patients with Parkinson's disease (PD) are increasingly investigated in search of biomarkers. However, the lipid profiles in genetic PD remain to be determined, a gap of knowledge of particular interest in PD associated with mutant α-synuclein (SNCA), given the known relationship between this protein and lipids. The objective of this research is to identify serum lipid composition from SNCA A53T mutation carriers and to compare these alterations to those found in cells and transgenic mice carrying the same genetic mutation. We conducted an unbiased lipidomic analysis of 530 lipid species from 34 lipid classes in serum of 30 participants with SNCA mutation with and without PD and 30 healthy controls. The primary analysis was done between 22 PD patients with SNCA+ (SNCA+/PD+) and 30 controls using machine-learning algorithms and traditional statistics. We also analyzed the lipid composition of human clonal-cell lines and tissue from transgenic mice overexpressing the same SNCA mutation. We identified specific lipid classes that best discriminate between SNCA+/PD+ patients and healthy controls and found certain lipid species, mainly from the glycerophosphatidylcholine and triradylglycerol classes, that are most contributory to this discrimination. Most of these alterations were also present in human derived cells and transgenic mice carrying the same mutation. Our combination of lipidomic and machine learning analyses revealed alterations in glycerophosphatidylcholine and triradylglycerol in sera from PD patients as well as cells and tissues expressing mutant α-Syn. Further investigations are needed to establish the pathogenic significance of these α-Syn-associated lipid changes.
RESUMO
BACKGROUND: The role of the lipidome as a biomarker for Parkinson's disease (PD) is a relatively new field that currently only focuses on PD diagnosis. OBJECTIVE: To identify a relevant lipidome signature for PD severity markers. METHODS: Disease severity of 149 PD patients was assessed by the Unified Parkinson's Disease Rating Scale (UPDRS) and the Montreal Cognitive Assessment (MoCA). The lipid composition of whole blood samples was analyzed, consisting of 517 lipid species from 37 classes; these included all major classes of glycerophospholipids, sphingolipids, glycerolipids, and sterols. To handle the high number of lipids, the selection of lipid species and classes was consolidated via analysis of interrelations between lipidomics and disease severity prediction using the random forest machine-learning algorithm aided by conventional statistical methods. RESULTS: Specific lipid classes dihydrosphingomyelin (dhSM), plasmalogen phosphatidylethanolamine (PEp), glucosylceramide (GlcCer), dihydro globotriaosylceramide (dhGB3), and to a lesser degree dihydro GM3 ganglioside (dhGM3), as well as species dhSM(20:0), PEp(38:6), PEp(42:7), GlcCer(16:0), GlcCer(24:1), dhGM3(22:0), dhGM3(16:0), and dhGB3(16:0) contribute to PD severity prediction of UPDRS III score. These, together with age, age at onset, and disease duration, also contribute to prediction of UPDRS total score. We demonstrate that certain lipid classes and species interrelate differently with the degree of severity of motor symptoms between men and women, and that predicting intermediate disease stages is more accurate than predicting less or more severe stages. CONCLUSION: Using machine-learning algorithms and methodologies, we identified lipid signatures that enable prediction of motor severity in PD. Future studies should focus on identifying the biological mechanisms linking GlcCer, dhGB3, dhSM, and PEp with PD severity.
Assuntos
Lipidômica , Doença de Parkinson , Biomarcadores , Feminino , Humanos , Lipídeos , Aprendizado de Máquina , Masculino , Doença de Parkinson/diagnóstico , Índice de Gravidade de DoençaRESUMO
Studies reveal that the false alarm rate (FAR) demonstrated by intensive care unit (ICU) vital signs monitors ranges from 0.72 to 0.99. We applied machine learning (ML) to ICU multi-sensor information to imitate a medical specialist in diagnosing patient condition. We hypothesized that applying this data-driven approach to medical monitors will help reduce the FAR even when data from sensors are missing. An expert-based rules algorithm identified and tagged in our dataset seven clinical alarm scenarios. We compared a random forest (RF) ML model trained using the tagged data, where parameters (e.g., heart rate or blood pressure) were (deliberately) removed, in detecting ICU signals with the full expert-based rules (FER), our ground truth, and partial expert-based rules (PER), missing these parameters. When all alarm scenarios were examined, RF and FER were almost identical. However, in the absence of one to three parameters, RF maintained its values of the Youden index (0.94-0.97) and positive predictive value (PPV) (0.98-0.99), whereas PER lost its value (0.54-0.8 and 0.76-0.88, respectively). While the FAR for PER with missing parameters was 0.17-0.39, it was only 0.01-0.02 for RF. When scenarios were examined separately, RF showed clear superiority in almost all combinations of scenarios and numbers of missing parameters. When sensor data are missing, specialist performance worsens with the number of missing parameters, whereas the RF model attains high accuracy and low FAR due to its ability to fuse information from available sensors, compensating for missing parameters.
Assuntos
Alarmes Clínicos/estatística & dados numéricos , Unidades de Terapia Intensiva , Aprendizado de Máquina , Cuidados Críticos/estatística & dados numéricos , Técnicas de Apoio para a Decisão , Sistemas Inteligentes , Reações Falso-Positivas , Humanos , Bases de Conhecimento , Monitorização Fisiológica/estatística & dados numéricos , Reconhecimento Automatizado de Padrão/estatística & dados numéricos , Estudos RetrospectivosRESUMO
Objective: Amyotrophic lateral sclerosis (ALS) disease state prediction usually assumes linear progression and uses a classifier evaluated by its accuracy. Since disease progression is not linear, and the accuracy measurement cannot tell large from small prediction errors, we dispense with the linearity assumption and apply ordinal classification that accounts for error severity. In addition, we identify the most influential variables in predicting and explaining the disease. Furthermore, in contrast to conventional modeling of the patient's total functionality, we also model separate patient functionalities (e.g., in walking or speaking). Methods: Using data from 3772 patients from the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT) database, we introduce and train ordinal classifiers to predict patients' disease state in their last clinic visit, while accounting differently for different error severities. We use feature-selection methods and the classifiers themselves to determine the most influential variables in predicting the disease from demographic, clinical, and laboratory data collected in either the first, last, or both clinic visits, and the Bayesian network classifier to identify interrelations among these variables and their relations with the disease state. We apply these methods to model each of the patient functionalities. Results: We show the error distribution in ALS state prediction and demonstrate that ordinal classifiers outperform classifiers that do not account for error severity. We identify clinical and lab test variables influential to prediction of different ALS functionalities and their interrelations, and specific value combinations of these variables that occur more frequently in patients with severe deterioration than in patients with mild deterioration and vice versa. Conclusions: Ordinal classification of ALS state is superior to conventional classification. Identification of influential ALS variables and their interrelations help explain disease mechanism. Modeling of patient functionalities separately allows relation of variables and their connections to different aspects of the disease as may be expressed in different body segments.
RESUMO
While young drivers (YDs) constitute â¼10% of the driver population, their fatality rate in motorcycle accidents is up to three times higher. Thus, we are interested in predicting fatal motorcycle accidents (FMAs), and in identifying their key factors and possible causes. Accurate prediction of YD FMAs from data by risk minimization using the 0/1 loss function (i.e., the ordinary classification accuracy) cannot be guaranteed because these accidents are only â¼1% of all YD motorcycle accidents, and classifiers tend to focus on the majority class of minor accidents at the expense of the minority class of fatal ones. Also, classifiers are usually uninformative (providing no information about the distribution of misclassifications), insensitive to error severity (making no distinction between misclassification of fatal accidents as severe or minor), and limited in identifying key factors. We propose to use an information measure (IM) that jointly maximizes accuracy and information and is sensitive to the error distribution and severity. Using a database of â¼3600 motorcycle accidents, a Bayesian network classifier optimized by IM predicted FMAs better than classifiers maximizing accuracy or other predictive or information measures, and identified fatal accident key factors and causal relations.
Assuntos
Acidentes de Trânsito/mortalidade , Motocicletas/estatística & dados numéricos , Adolescente , Adulto , Teorema de Bayes , Confiabilidade dos Dados , Feminino , Humanos , Masculino , Medição de Risco , Adulto JovemRESUMO
The identification of catalytic residues is an essential step in functional characterization of enzymes. We present a purely structural approach to this problem, which is motivated by the difficulty of evolution-based methods to annotate structural genomics targets that have few or no homologs in the databases. Our approach combines a state-of-the-art support vector machine (SVM) classifier with novel structural features that augment structural clues by spatial averaging and Z scoring. Special attention is paid to the class imbalance problem that stems from the overwhelming number of non-catalytic residues in enzymes compared to catalytic residues. This problem is tackled by: (1) optimizing the classifier to maximize a performance criterion that considers both Type I and Type II errors in the classification of catalytic and non-catalytic residues; (2) under-sampling non-catalytic residues before SVM training; and (3) during SVM training, penalizing errors in learning catalytic residues more than errors in learning non-catalytic residues. Tested on four enzyme datasets, one specifically designed by us to mimic the structural genomics scenario and three previously evaluated datasets, our structure-based classifier is never inferior to similar structure-based classifiers and comparable to classifiers that use both structural and evolutionary features. In addition to the evaluation of the performance of catalytic residue identification, we also present detailed case studies on three proteins. This analysis suggests that many false positive predictions may correspond to binding sites and other functional residues. A web server that implements the method, our own-designed database, and the source code of the programs are publicly available at http://www.cs.bgu.ac.il/â¼meshi/functionPrediction.
Assuntos
Inteligência Artificial , Enzimas/química , Genômica/métodos , Domínio Catalítico , Bases de Dados de Proteínas , Conformação ProteicaRESUMO
In this paper, we modify the fuzzy ARTMAP (FA) neural network (NN) using the Bayesian framework in order to improve its classification accuracy while simultaneously reduce its category proliferation. The proposed algorithm, called Bayesian ARTMAP (BA), preserves the FA advantages and also enhances its performance by the following: (1) representing a category using a multidimensional Gaussian distribution, (2) allowing a category to grow or shrink, (3) limiting a category hypervolume, (4) using Bayes' decision theory for learning and inference, and (5) employing the probabilistic association between every category and a class in order to predict the class. In addition, the BA estimates the class posterior probability and thereby enables the introduction of loss and classification according to the minimum expected loss. Based on these characteristics and using synthetic and 20 real-world databases, we show that the BA outperformes the FA, either trained for one epoch or until completion, with respect to classification accuracy, sensitivity to statistical overlapping, learning curves, expected loss, and category proliferation.
Assuntos
Algoritmos , Inteligência Artificial , Teorema de Bayes , Lógica Fuzzy , Redes Neurais de Computação , Simulação por Computador , Bases de Dados como Assunto , Diagnóstico por Computador , Interpretação de Imagem Assistida por Computador , Distribuição Normal , Reconhecimento Automatizado de Padrão , Software , Validação de Programas de ComputadorRESUMO
Signal segmentation and classification of fluorescence in situ hybridization (FISH) images are essential for the detection of cytogenetic abnormalities. Since current methods are limited to dot-like signal analysis, we propose a methodology for segmentation and classification of dot and non-dot-like signals. First, nuclei are segmented from their background and from each other in order to associate signals with specific isolated nuclei. Second, subsignals composing non-dot-like signals are detected and clustered to signals. Features are measured to the signals and a subset of these features is selected representing the signals to a multiclass classifier. Classification using a naive Bayesian classifier (NBC) or a multilayer perceptron is accomplished. When applied to a FISH image database, dot and non-dot-like signals were segmented almost perfectly and then classified with accuracy of approximately 80% by either of the classifiers.
Assuntos
Inteligência Artificial , Núcleo Celular/genética , Aberrações Cromossômicas , Hibridização in Situ Fluorescente/métodos , Reconhecimento Automatizado de Padrão/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Mapeamento Cromossômico/métodos , HumanosRESUMO
Solving a multiclass classification task using a small imbalanced database of patterns of high dimension is difficult due to the curse-of-dimensionality and the bias of the training toward the majority classes. Such a problem has arisen while diagnosing genetic abnormalities by classifying a small database of fluorescence in situ hybridization signals of types having different frequencies of occurrence. We propose and experimentally study using the cytogenetic domain two solutions to the problem. The first is hierarchical decomposition of the classification task, where each hierarchy level is designed to tackle a simpler problem which is represented by classes that are approximately balanced. The second solution is balancing the data by up-sampling the minority classes accompanied by dimensionality reduction. Implemented by the naive Bayesian classifier or the multilayer perceptron neural network, both solutions have diminished the problem and contributed to accuracy improvement. In addition, the experiments suggest that coping with the smallness of the data is more beneficial than dealing with its imbalance.
Assuntos
Inteligência Artificial , Mapeamento Cromossômico/métodos , Interpretação de Imagem Assistida por Computador/métodos , Hibridização in Situ Fluorescente/métodos , Reconhecimento Automatizado de Padrão/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Análise por Conglomerados , Análise Citogenética , Bases de Dados Factuais , Humanos , Armazenamento e Recuperação da Informação , Microscopia de Fluorescência/métodosRESUMO
We propose and investigate the fuzzy ARTMAP neural network in off and online classification of fluorescence in situ hybridization image signals enabling clinical diagnosis of numerical genetic abnormalities. We evaluate the classification task (detecting a several abnormalities separately or simultaneously), classifier paradigm (monolithic or hierarchical), ordering strategy for the training patterns (averaging or voting), training mode (for one epoch, with validation or until completion) and model sensitivity to parameters. We find the fuzzy ARTMAP accurate in accomplishing both tasks requiring only very few training epochs. Also, selecting a training ordering by voting is more precise than if averaging over orderings. If trained for only one epoch, the fuzzy ARTMAP provides fast, yet stable and accurate learning as well as insensitivity to model complexity. Early stop of training using a validation set reduces the fuzzy ARTMAP complexity as for other machine learning models but cannot improve accuracy beyond that achieved when training is completed. Compared to other machine learning models, the fuzzy ARTMAP does not loose but gain accuracy when overtrained, although increasing its number of categories. Learned incrementally, the fuzzy ARTMAP reaches its ultimate accuracy very fast obtaining most of its data representation capability and accuracy by using only a few examples. Finally, the fuzzy ARTMAP accuracy for this domain is comparable with those of the multilayer perceptron and support vector machine and superior to those of the naive Bayesian and linear classifiers.
Assuntos
Cromossomos Humanos Par 13/genética , Síndrome de Down/diagnóstico , Síndrome de Down/genética , Interpretação de Imagem Assistida por Computador/métodos , Hibridização in Situ Fluorescente/métodos , Trissomia/diagnóstico , Trissomia/genética , Inteligência Artificial , Lógica Fuzzy , Testes Genéticos/métodos , Humanos , Microscopia de Fluorescência/métodos , Sistemas On-Line , Reconhecimento Automatizado de Padrão/métodos , Reprodutibilidade dos Testes , Sensibilidade e EspecificidadeRESUMO
Previous research has indicated the significance of accurate classification of fluorescence in situ hybridisation (FISH) signals for the detection of genetic abnormalities. Based on well-discriminating features and a trainable neural network (NN) classifier, a previous system enabled highly-accurate classification of valid signals and artefacts of two fluorophores. However, since this system employed several features that are considered independent, the naive Bayesian classifier (NBC) is suggested here as an alternative to the NN. The NBC independence assumption permits the decomposition of the high-dimensional likelihood of the model for the data into a product of one-dimensional probability densities. The naive independence assumption together with the Bayesian methodology allow the NBC to predict a posteriori probabilities of class membership using estimated class-conditional densities in a close and simple form. Since the probability densities are the only parameters of the NBC, the misclassification rate of the model is determined exclusively by the quality of density estimation. Densities are evaluated by three methods: single Gaussian estimation (SGE; parametric method), Gaussian mixture model assuming spherical covariance matrices (GMM; semi-parametric method) and kernel density estimation (KDE; non-parametric method). For low-dimensional densities, the GMM generally outperforms the KDE that tends to overfit the training set at the cost of reduced generalisation capability. But, it is the GMM that loses some accuracy when modelling higher-dimensional densities due to the violation of the assumption of spherical covariance matrices when dependent features are added to the set. Compared with these two methods, the SGE and NN provide inferior and superior performance, respectively. However, the NBC avoids the intensive training and optimisation required for the NN, demanding extensive resources and experimentation. Therefore, when supporting these two classifiers, the system enables a trade-off between the NN performance and NBC simplicity of implementation.