Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
1.
PLoS Comput Biol ; 18(12): e1010718, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-36520712

RESUMO

Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call "feature" a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.


Assuntos
Biologia Computacional , Aprendizado de Máquina , Humanos , Biologia Computacional/métodos , Engenharia
2.
Entropy (Basel) ; 23(1)2021 Jan 12.
Artigo em Inglês | MEDLINE | ID: mdl-33445650

RESUMO

In this paper, we deal with the classical Statistical Learning Theory's problem of bounding, with high probability, the true risk R(h) of a hypothesis h chosen from a set H of m hypotheses. The Union Bound (UB) allows one to state that PLR^(h),δqh≤R(h)≤UR^(h),δph≥1-δ where R^(h) is the empirical errors, if it is possible to prove that P{R(h)≥L(R^(h),δ)}≥1-δ and P{R(h)≤U(R^(h),δ)}≥1-δ, when h, qh, and ph are chosen before seeing the data such that qh,ph∈[0,1] and ∑h∈H(qh+ph)=1. If no a priori information is available qh and ph are set to 12m, namely equally distributed. This approach gives poor results since, as a matter of fact, a learning procedure targets just particular hypotheses, namely hypotheses with small empirical error, disregarding the others. In this work we set the qh and ph in a distribution-dependent way increasing the probability of being chosen to function with small true risk. We will call this proposal Distribution-Dependent Weighted UB (DDWUB) and we will retrieve the sufficient conditions on the choice of qh and ph that state that DDWUB outperforms or, in the worst case, degenerates into UB. Furthermore, theoretical and numerical results will show the applicability, the validity, and the potentiality of DDWUB.

3.
Entropy (Basel) ; 23(8)2021 Aug 14.
Artigo em Inglês | MEDLINE | ID: mdl-34441187

RESUMO

In many decision-making scenarios, ranging from recreational activities to healthcare and policing, the use of artificial intelligence coupled with the ability to learn from historical data is becoming ubiquitous. This widespread adoption of automated systems is accompanied by the increasing concerns regarding their ethical implications. Fundamental rights, such as the ones that require the preservation of privacy, do not discriminate based on sensible attributes (e.g., gender, ethnicity, political/sexual orientation), or require one to provide an explanation for a decision, are daily undermined by the use of increasingly complex and less understandable yet more accurate learning algorithms. For this purpose, in this work, we work toward the development of systems able to ensure trustworthiness by delivering privacy, fairness, and explainability by design. In particular, we show that it is possible to simultaneously learn from data while preserving the privacy of the individuals thanks to the use of Homomorphic Encryption, ensuring fairness by learning a fair representation from the data, and ensuring explainable decisions with local and global explanations without compromising the accuracy of the final models. We test our approach on a widespread but still controversial application, namely face recognition, using the recent FairFace dataset to prove the validity of our approach.

4.
Front Oncol ; 12: 845936, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35756625

RESUMO

Neuroblastoma (NB) is the most common extracranial malignant tumor in children. Although the survival rate of NB has improved over the years, the outcome of NB still remains poor for over 30% of cases. A more accurate risk stratification remains a key point in the study of NB and the availability of novel prognostic biomarkers of "high-risk" at diagnosis could help improving patient stratification and predicting outcome. In this paper we show a biomarker discovery approach applied to the plasma of 172 NB patients. Plasma samples from a first cohort of NB patients and age-matched healthy controls were used for untargeted metabolomics analysis based on high-resolution mass spectrometry (HRMS). Differential expression analysis highlighted a number of metabolites annotated with a high degree of identification. Among them, 3-O-methyldopa (3-O-MD) was validated in a second cohort of NB patients using a targeted metabolite profiling approach and its prognostic potential was also analyzed by survival analysis on patients with 3 years follow-up. High expression of 3-O-MD was associated with worse prognosis in the subset of patients with stage M tumor (log-rank p < 0.05) and, among them, it was confirmed as a prognostic factor able to stratify high-risk patients older than 18 months. 3-O-MD might be thus considered as a novel prognostic biomarker of NB eligible to be included at diagnosis among catecholamine metabolite panels in prospective clinical studies. Further studies are warranted to exploit other potential biomarkers highlighted using our approach.

5.
BioData Min ; 14(1): 12, 2021 Feb 03.
Artigo em Inglês | MEDLINE | ID: mdl-33536030

RESUMO

BACKGROUND: Sepsis is a life-threatening clinical condition that happens when the patient's body has an excessive reaction to an infection, and should be treated in one hour. Due to the urgency of sepsis, doctors and physicians often do not have enough time to perform laboratory tests and analyses to help them forecast the consequences of the sepsis episode. In this context, machine learning can provide a fast computational prediction of sepsis severity, patient survival, and sequential organ failure by just analyzing the electronic health records of the patients. Also, machine learning can be employed to understand which features in the medical records are more predictive of sepsis severity, of patient survival, and of sequential organ failure in a fast and non-invasive way. DATASET AND METHODS: In this study, we analyzed a dataset of electronic health records of 364 patients collected between 2014 and 2016. The medical record of each patient has 29 clinical features, and includes a binary value for survival, a binary value for septic shock, and a numerical value for the sequential organ failure assessment (SOFA) score. We disjointly utilized each of these three factors as an independent target, and employed several machine learning methods to predict it (binary classifiers for survival and septic shock, and regression analysis for the SOFA score). Afterwards, we used a data mining approach to identify the most important dataset features in relation to each of the three targets separately, and compared these results with the results achieved through a standard biostatistics approach. RESULTS AND CONCLUSIONS: Our results showed that machine learning can be employed efficiently to predict septic shock, SOFA score, and survival of patients diagnoses with sepsis, from their electronic health records data. And regarding clinical feature ranking, our results showed that Random Forests feature selection identified several unexpected symptoms and clinical components as relevant for septic shock, SOFA score, and survival. These discoveries can help doctors and physicians in understanding and predicting septic shock. We made the analyzed dataset and our developed software code publicly available online.

6.
Health Informatics J ; 27(1): 1460458220984205, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33504243

RESUMO

Liver cancer kills approximately 800 thousand people annually worldwide, and its most common subtype is hepatocellular carcinoma (HCC), which usually affects people with cirrhosis. Predicting survival of patients with HCC remains an important challenge, especially because technologies needed for this scope are not available in all hospitals. In this context, machine learning applied to medical records can be a fast, low-cost tool to predict survival and detect the most predictive features from health records. In this study, we analyzed medical data of 165 patients with HCC: we employed computational intelligence to predict their survival, and to detect the most relevant clinical factors able to discriminate survived from deceased cases. Afterwards, we compared our data mining results with those obtained through statistical tests and scientific literature findings. Our analysis revealed that blood levels of alkaline-phosphatase (ALP), alpha-fetoprotein (AFP), and hemoglobin are the most effective prognostic factors in this dataset. We found literature supporting association of these three factors with hepatoma, even though only AFP has been used in a prognostic index. Our results suggest that ALP and hemoglobin can be candidates for future HCC prognostic indexes, and that physicians could focus on ALP, AFP, and hemoglobin when studying HCC records.


Assuntos
Carcinoma Hepatocelular , Neoplasias Hepáticas , Fosfatase Alcalina , Inteligência Artificial , Carcinoma Hepatocelular/diagnóstico , Humanos , alfa-Fetoproteínas
7.
IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2759-2765, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33259306

RESUMO

Myocardial infarctions and heart failure are the cause of more than 17 million deaths annually worldwide. ST-segment elevation myocardial infarctions (STEMI) require timely treatment, because delays of minutes have serious clinical impacts. Machine learning can provide alternative ways to predict heart failure and identify genes involved in heart failure. For these scopes, we applied a Random Forests classifier enhanced with feature elimination to microarray gene expression of 111 patients diagnosed with STEMI, and measured the classification performance through standard metrics such as the Matthews correlation coefficient (MCC) and area under the receiver operating characteristic curve (ROC AUC). Afterwards, we used the same approach to rank all genes by importance, and to detect the genes more strongly associated with heart failure. We validated this ranking by literature review and gene set enrichment analysis. Our classifier employed to predict heart failure achieved MCC = +0.87 and ROC AUC = 0.918, and our analysis identified KLHL22, WDR11, OR4Q3, GPATCH3, and FAH as top five protein-coding genes related to heart failure. Our results confirm the effectiveness of machine learning feature elimination in predicting heart failure from gene expression, and the top genes found by our approach will be able to help biologists and cardiologists further our understanding of heart failure.


Assuntos
Biologia Computacional/métodos , Insuficiência Cardíaca/genética , Aprendizado de Máquina , Modelos Estatísticos , Transcriptoma/genética , Algoritmos , Árvores de Decisões , Humanos
8.
Open Res Eur ; 1: 110, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-37645142

RESUMO

BACKGROUND: The air traffic management (ATM) system has historically coped with a global increase in traffic demand ultimately leading to increased operational complexity. When dealing with the impact of this increasing complexity on system safety it is crucial to automatically analyse the losses of separation (LoSs) using tools able to extract meaningful and actionable information from safety reports. Current research in this field mainly exploits natural language processing (NLP) to categorise the reports,with the limitations that the considered categories need to be manually annotated by experts and that general taxonomies are seldom exploited. METHODS: To address the current gaps,authors propose to perform exploratory data analysis on safety reports combining state-of-the-art techniques like topic modelling and clustering and then to develop an algorithm able to extract the Toolkit for ATM Occurrence Investigation (TOKAI) taxonomy factors from the free-text safety reports based on syntactic analysis. TOKAI is a tool for investigation developed by EUROCONTROL and its taxonomy is intended to become a standard and harmonised approach to future investigations. RESULTS: Leveraging on the LoS events reported in the public databases of the Comisión de Estudio y Análisis de Notificaciones de Incidentes de Tránsito Aéreo and the United Kingdom Airprox Board,authors show how their proposal is able to automatically extract meaningful and actionable information from safety reports,other than to classify their content according to the TOKAI taxonomy. The quality of the approach is also indirectly validated by checking the connection between the identified factors and the main contributor of the incidents. CONCLUSIONS: Authors' results are a promising first step toward the full automation of a general analysis of LoS reports supported by results on real-world data coming from two different sources. In the future,authors' proposal could be extended to other taxonomies or tailored to identify factors to be included in the safety taxonomies.

9.
IEEE Trans Neural Netw Learn Syst ; 29(10): 4660-4671, 2018 10.
Artigo em Inglês | MEDLINE | ID: mdl-29990207

RESUMO

When dealing with kernel methods, one has to decide which kernel and which values for the hyperparameters to use. Resampling techniques can address this issue but these procedures are time-consuming. This problem is particularly challenging when dealing with structured data, in particular with graphs, since several kernels for graph data have been proposed in literature, but no clear relationship among them in terms of learning properties is defined. In these cases, exhaustive search seems to be the only reasonable approach. Recently, the global Rademacher complexity (RC) and local Rademacher complexity (LRC), two powerful measures of the complexity of a hypothesis space, have shown to be suited for studying kernels properties. In particular, the LRC is able to bound the generalization error of an hypothesis chosen in a space by disregarding those ones which will not be taken into account by any learning procedure because of their high error. In this paper, we show a new approach to efficiently bound the RC of the space induced by a kernel, since its exact computation is an NP-Hard problem. Then we show for the first time that RC can be used to estimate the accuracy and expressivity of different graph kernels under different parameter configurations. The authors' claims are supported by experimental results on several real-world graph data sets.

10.
Neural Netw ; 82: 62-75, 2016 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-27474843

RESUMO

We define in this work a new localized version of a Vapnik-Chervonenkis (VC) complexity, namely the Local VC-Entropy, and, building on this new complexity, we derive a new generalization bound for binary classifiers. The Local VC-Entropy-based bound improves on the original Vapnik's results because it is able to discard those functions that, most likely, will not be selected during the learning phase. The result is achieved by applying the localization principle to the original global complexity measure, in the same spirit of the Local Rademacher Complexity. By exploiting and improving a recently developed geometrical framework, we show that it is also possible to relate the Local VC-Entropy to the Local Rademacher Complexity by finding an admissible range for one given the other. In addition, the Local VC-Entropy allows one to reduce the computational requirements that arise when dealing with the Local Rademacher Complexity in binary classification problems.


Assuntos
Entropia , Aprendizado de Máquina , Aprendizado de Máquina/tendências
11.
IEEE Trans Cybern ; 45(9): 1913-26, 2015 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-25347893

RESUMO

The purpose of this paper is to obtain a fully empirical stability-based bound on the generalization ability of a learning procedure, thus, circumventing some limitations of the structural risk minimization framework. We show that assuming a desirable property of a learning algorithm is sufficient to make data-dependency explicit for stability, which, instead, is usually bounded only in an algorithmic-dependent way. In addition, we prove that a well-known and widespread classifier, like the support vector machine (SVM), satisfies this condition. The obtained bound is then exploited for model selection purposes in SVM classification and tested on a series of real-world benchmarking datasets demonstrating, in practice, the effectiveness of our approach.

12.
Neural Netw ; 65: 115-25, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-25734890

RESUMO

We derive in this paper a new Local Rademacher Complexity risk bound on the generalization ability of a model, which is able to take advantage of the availability of unlabeled samples. Moreover, this new bound improves state-of-the-art results even when no unlabeled samples are available.


Assuntos
Inteligência Artificial , Modelos Estatísticos
13.
IEEE Trans Neural Netw Learn Syst ; 25(12): 2202-11, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25420243

RESUMO

In this paper, we derive a deep connection between the Vapnik-Chervonenkis (VC) entropy and the Rademacher complexity. For this purpose, we first refine some previously known relationships between the two notions of complexity and then derive new results, which allow computing an admissible range for the Rademacher complexity, given a value of the VC-entropy, and vice versa. The approach adopted in this paper is new and relies on the careful analysis of the combinatorial nature of the problem. The obtained results improve the state of the art on this research topic.

14.
Neural Netw ; 44: 107-11, 2013 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-23587720

RESUMO

The problem of assessing the performance of a classifier, in the finite-sample setting, has been addressed by Vapnik in his seminal work by using data-independent measures of complexity. Recently, several authors have addressed the same problem by proposing data-dependent measures, which tighten previous results by taking in account the actual data distribution. In this framework, we derive some data-dependent bounds on the generalization ability of a classifier by exploiting the Rademacher Complexity and recent concentration results: in addition of being appealing for practical purposes, as they exploit empirical quantities only, these bounds improve previously known results.


Assuntos
Inteligência Artificial , Estatística como Assunto/tendências , Estatística como Assunto/métodos
15.
IEEE Trans Neural Netw Learn Syst ; 23(9): 1390-406, 2012 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-24807923

RESUMO

In-sample approaches to model selection and error estimation of support vector machines (SVMs) are not as widespread as out-of-sample methods, where part of the data is removed from the training set for validation and testing purposes, mainly because their practical application is not straightforward and the latter provide, in many cases, satisfactory results. In this paper, we survey some recent and not-so-recent results of the data-dependent structural risk minimization framework and propose a proper reformulation of the SVM learning algorithm, so that the in-sample approach can be effectively applied. The experiments, performed both on simulated and real-world datasets, show that our in-sample approach can be favorably compared to out-of-sample methods, especially in cases where the latter ones provide questionable results. In particular, when the number of samples is small compared to their dimensionality, like in classification of microarray data, our proposal can outperform conventional out-of-sample approaches such as the cross validation, the leave-one-out, or the Bootstrap methods.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA