Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 45
Filtrar
Más filtros

Bases de datos
Tipo del documento
Intervalo de año de publicación
1.
PLoS Comput Biol ; 19(1): e1010778, 2023 01.
Artículo en Inglés | MEDLINE | ID: mdl-36602952

RESUMEN

Medical imaging is a great asset for modern medicine, since it allows physicians to spatially interrogate a disease site, resulting in precise intervention for diagnosis and treatment, and to observe particular aspect of patients' conditions that otherwise would not be noticeable. Computational analysis of medical images, moreover, can allow the discovery of disease patterns and correlations among cohorts of patients with the same disease, thus suggesting common causes or providing useful information for better therapies and cures. Machine learning and deep learning applied to medical images, in particular, have produced new, unprecedented results that can pave the way to advanced frontiers of medical discoveries. While computational analysis of medical images has become easier, however, the possibility to make mistakes or generate inflated or misleading results has become easier, too, hindering reproducibility and deployment. In this article, we provide ten quick tips to perform computational analysis of medical images avoiding common mistakes and pitfalls that we noticed in multiple studies in the past. We believe our ten guidelines, if taken into practice, can help the computational-medical imaging community to perform better scientific research that eventually can have a positive impact on the lives of patients worldwide.


Asunto(s)
Aprendizaje Automático , Humanos , Reproducibilidad de los Resultados
2.
PLoS Comput Biol ; 19(7): e1011224, 2023 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-37410704

RESUMEN

Data are the most important elements of bioinformatics: Computational analysis of bioinformatics data, in fact, can help researchers infer new knowledge about biology, chemistry, biophysics, and sometimes even medicine, influencing treatments and therapies for patients. Bioinformatics and high-throughput biological data coming from different sources can even be more helpful, because each of these different data chunks can provide alternative, complementary information about a specific biological phenomenon, similar to multiple photos of the same subject taken from different angles. In this context, the integration of bioinformatics and high-throughput biological data gets a pivotal role in running a successful bioinformatics study. In the last decades, data originating from proteomics, metabolomics, metagenomics, phenomics, transcriptomics, and epigenomics have been labelled -omics data, as a unique name to refer to them, and the integration of these omics data has gained importance in all biological areas. Even if this omics data integration is useful and relevant, due to its heterogeneity, it is not uncommon to make mistakes during the integration phases. We therefore decided to present these ten quick tips to perform an omics data integration correctly, avoiding common mistakes we experienced or noticed in published studies in the past. Even if we designed our ten guidelines for beginners, by using a simple language that (we hope) can be understood by anyone, we believe our ten recommendations should be taken into account by all the bioinformaticians performing omics data integration, including experts.


Asunto(s)
Genómica , Multiómica , Humanos , Proteómica , Biología Computacional , Metabolómica
3.
PLoS Comput Biol ; 19(7): e1011272, 2023 07.
Artículo en Inglés | MEDLINE | ID: mdl-37471333

RESUMEN

Some scientific studies involve huge amounts of bioinformatics data that cannot be analyzed on personal computers usually employed by researchers for day-to-day activities but rather necessitate effective computational infrastructures that can work in a distributed way. For this purpose, distributed computing systems have become useful tools to analyze large amounts of bioinformatics data and to generate relevant results on virtual environments, where software can be executed for hours or even days without affecting the personal computer or laptop of a researcher. Even if distributed computing resources have become pivotal in multiple bioinformatics laboratories, often researchers and students use them in the wrong ways, making mistakes that can cause the distributed computers to underperform or that can even generate wrong outcomes. In this context, we present here ten quick tips for the usage of Apache Spark distributed computing systems for bioinformatics analyses: ten simple guidelines that, if taken into account, can help users avoid common mistakes and can help them run their bioinformatics analyses smoothly. Even if we designed our recommendations for beginners and students, they should be followed by experts too. We think our quick tips can help anyone make use of Apache Spark distributed computing systems more efficiently and ultimately help generate better, more reliable scientific results.


Asunto(s)
Biología Computacional , Programas Informáticos , Humanos , Biología Computacional/métodos , Computadores , Redes de Comunicación de Computadores
4.
PLoS Comput Biol ; 19(12): e1011700, 2023 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-38127800

RESUMEN

Fuzzy logic is useful tool to describe and represent biological or medical scenarios, where often states and outcomes are not only completely true or completely false, but rather partially true or partially false. Despite its usefulness and spread, fuzzy logic modeling might easily be done in the wrong way, especially by beginners and unexperienced researchers, who might overlook some important aspects or might make common mistakes. Malpractices and pitfalls, in turn, can lead to wrong or overoptimistic, inflated results, with negative consequences to the biomedical research community trying to comprehend a particular phenomenon, or even to patients suffering from the investigated disease. To avoid common mistakes, we present here a list of quick tips for fuzzy logic modeling any biomedical scenario: some guidelines which should be taken into account by any fuzzy logic practitioner, including experts. We believe our best practices can have a strong impact in the scientific community, allowing researchers who follow them to obtain better, more reliable results and outcomes in biomedical contexts.


Asunto(s)
Lógica Difusa , Medicina , Humanos
5.
Bioinformatics ; 38(6): 1761-1763, 2022 03 04.
Artículo en Inglés | MEDLINE | ID: mdl-34935889

RESUMEN

SUMMARY: Having multiple datasets is a key aspect of robust bioinformatics analyses, because it allows researchers to find possible confirmation of the discoveries made on multiple cohorts. For this purpose, Gene Expression Omnibus (GEO) can be a useful database, since it provides hundreds of thousands of microarray gene expression datasets freely available for download and usage. Despite this large availability, collecting prognostic datasets of a specific cancer type from GEO can be a long, time-consuming and energy-consuming activity for any bioinformatician, who needs to execute it manually by first performing a search on the GEO website and then by checking all the datasets found one by one. To solve this problem, we present here geoCancerPrognosticDatasetsRetriever, a Perl 5 application which reads a cancer type and a list of microarray platforms, searches for prognostic gene expression datasets of that cancer type and based on those platforms available on GEO, and returns the GEO accession codes of those datasets, if found. Our bioinformatics tool can easily generate in a few minutes a list of cancer prognostic datasets that otherwise would require numerous hours of manual work to any bioinformatician. geoCancerPrognosticDatasetsRetriever can handily retrieve multiple prognostic datasets of gene expression of any cancer type, laying the foundations for numerous bioinformatics studies and meta-analyses that can have a strong impact on oncology research. AVAILABILITY AND IMPLEMENTATION: geoCancerPrognosticDatasetsRetriever is freely available under the GPLv2 license on the Comprehensive Perl Archive Network (CPAN) at https://metacpan.org/pod/App::geoCancerPrognosticDatasetsRetriever and on GitHub at https://github.com/AbbasAlameer/geoCancerPrognosticDatasetsRetriever. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Perfilación de la Expresión Génica , Neoplasias , Humanos , Perfilación de la Expresión Génica/métodos , Pronóstico , Biología Computacional/métodos , Expresión Génica , Neoplasias/genética
6.
PLoS Comput Biol ; 18(8): e1010348, 2022 08.
Artículo en Inglés | MEDLINE | ID: mdl-35951505

RESUMEN

Pathway enrichment analysis (PEA) is a computational biology method that identifies biological functions that are overrepresented in a group of genes more than would be expected by chance and ranks these functions by relevance. The relative abundance of genes pertinent to specific pathways is measured through statistical methods, and associated functional pathways are retrieved from online bioinformatics databases. In the last decade, along with the spread of the internet, higher availability of computational resources made PEA software tools easy to access and to use for bioinformatics practitioners worldwide. Although it became easier to use these tools, it also became easier to make mistakes that could generate inflated or misleading results, especially for beginners and inexperienced computational biologists. With this article, we propose nine quick tips to avoid common mistakes and to out a complete, sound, thorough PEA, which can produce relevant and robust results. We describe our nine guidelines in a simple way, so that they can be understood and used by anyone, including students and beginners. Some tips explain what to do before starting a PEA, others are suggestions of how to correctly generate meaningful results, and some final guidelines indicate some useful steps to properly interpret PEA results. Our nine tips can help users perform better pathway enrichment analyses and eventually contribute to a better understanding of current biology.


Asunto(s)
Biología Computacional , Programas Informáticos , Biología Computacional/métodos , Bases de Datos Factuales , Humanos
7.
PLoS Comput Biol ; 18(8): e1010395, 2022 08.
Artículo en Inglés | MEDLINE | ID: mdl-36006874

RESUMEN

Special sessions are important parts of scientific meetings and conferences: They gather together researchers and students interested in a specific topic and can strongly contribute to the success of the conference itself. Moreover, they can be the first step for trainees and students to the organization of a scientific event. Organizing a special session, however, can be uneasy for beginners and students. Here, we provide ten simple rules to follow to organize a special session at a scientific conference.


Asunto(s)
Investigadores , Estudiantes , Humanos
8.
PLoS Comput Biol ; 18(12): e1010718, 2022 12.
Artículo en Inglés | MEDLINE | ID: mdl-36520712

RESUMEN

Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call "feature" a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.


Asunto(s)
Biología Computacional , Aprendizaje Automático , Humanos , Biología Computacional/métodos , Ingeniería
9.
J Biomed Inform ; 144: 104426, 2023 08.
Artículo en Inglés | MEDLINE | ID: mdl-37352899

RESUMEN

Even if assessing binary classifications is a common task in scientific research, no consensus on a single statistic summarizing the confusion matrix has been reached so far. In recent studies, we demonstrated the advantages of the Matthews correlation coefficient (MCC) over other popular rates such as cross-entropy error, F1 score, accuracy, balanced accuracy, bookmaker informedness, diagnostic odds ratio, Brier score, and Cohen's kappa. In this study, we compared the MCC to other two statistics: prevalence threshold (PT), frequently used in obstetrics and gynecology, and Fowlkes-Mallows index, a metric employed in fuzzy logic and drug discovery. Through the investigation of the mutual relations among three metrics and the study of some relevant use cases, we show that, when positive data elements and negative data elements have the same importance, the Matthews correlation coefficient can be more informative than its two competitors, even this time.


Asunto(s)
Algoritmos , Lógica Difusa , Prevalencia , Descubrimiento de Drogas , Entropía
10.
BMC Genomics ; 21(1): 6, 2020 Jan 02.
Artículo en Inglés | MEDLINE | ID: mdl-31898477

RESUMEN

BACKGROUND: To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets. RESULTS: The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset. CONCLUSIONS: In this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F1 score in evaluating binary classification tasks by all scientific communities.


Asunto(s)
Correlación de Datos , Interpretación Estadística de Datos , Aprendizaje Automático/estadística & datos numéricos , Algoritmos , Biología Computacional/estadística & datos numéricos
11.
BMC Med Inform Decis Mak ; 20(1): 16, 2020 02 03.
Artículo en Inglés | MEDLINE | ID: mdl-32013925

RESUMEN

BACKGROUND: Cardiovascular diseases kill approximately 17 million people globally every year, and they mainly exhibit as myocardial infarctions and heart failures. Heart failure (HF) occurs when the heart cannot pump enough blood to meet the needs of the body.Available electronic medical records of patients quantify symptoms, body features, and clinical laboratory test values, which can be used to perform biostatistics analysis aimed at highlighting patterns and correlations otherwise undetectable by medical doctors. Machine learning, in particular, can predict patients' survival from their data and can individuate the most important features among those included in their medical records. METHODS: In this paper, we analyze a dataset of 299 patients with heart failure collected in 2015. We apply several machine learning classifiers to both predict the patients survival, and rank the features corresponding to the most important risk factors. We also perform an alternative feature ranking analysis by employing traditional biostatistics tests, and compare these results with those provided by the machine learning algorithms. Since both feature ranking approaches clearly identify serum creatinine and ejection fraction as the two most relevant features, we then build the machine learning survival prediction models on these two factors alone. RESULTS: Our results of these two-feature models show not only that serum creatinine and ejection fraction are sufficient to predict survival of heart failure patients from medical records, but also that using these two features alone can lead to more accurate predictions than using the original dataset features in its entirety. We also carry out an analysis including the follow-up month of each patient: even in this case, serum creatinine and ejection fraction are the most predictive clinical features of the dataset, and are sufficient to predict patients' survival. CONCLUSIONS: This discovery has the potential to impact on clinical practice, becoming a new supporting tool for physicians when predicting if a heart failure patient will survive or not. Indeed, medical doctors aiming at understanding if a patient will survive after heart failure may focus mainly on serum creatinine and ejection fraction.


Asunto(s)
Creatinina/sangre , Insuficiencia Cardíaca/fisiopatología , Aprendizaje Automático , Volumen Sistólico/fisiología , Análisis de Supervivencia , Adulto , Anciano , Anciano de 80 o más Años , Algoritmos , Bases de Datos Factuales , Registros Electrónicos de Salud/estadística & datos numéricos , Femenino , Predicción , Humanos , Masculino , Persona de Mediana Edad , Factores de Riesgo
12.
BMC Bioinformatics ; 16 Suppl 6: S4, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-25916950

RESUMEN

BACKGROUND: Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful. METHODS: We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set. RESULTS: We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm. CONCLUSIONS: Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper weighting policy, it is able to predict a significant number of novel annotations, demonstrating to actually be a helpful tool in supporting scientists in the curation process of gene functional annotations.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Proteínas de Drosophila/genética , Drosophila melanogaster/genética , Ontología de Genes , Anotación de Secuencia Molecular , Animales , Análisis por Conglomerados
13.
PeerJ Comput Sci ; 10: e1896, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38435625

RESUMEN

Diabetes is a metabolic disorder that affects more than 420 million of people worldwide, and it is caused by the presence of a high level of sugar in blood for a long period. Diabetes can have serious long-term health consequences, such as cardiovascular diseases, strokes, chronic kidney diseases, foot ulcers, retinopathy, and others. Even if common, this disease is uneasy to spot, because it often comes with no symptoms. Especially for diabetes type 2, that happens mainly in the adults, knowing how long the diabetes has been present for a patient can have a strong impact on the treatment they can receive. This information, although pivotal, might be absent: for some patients, in fact, the year when they received the diabetes diagnosis might be well-known, but the year of the disease unset might be unknown. In this context, machine learning applied to electronic health records can be an effective tool to predict the past duration of diabetes for a patient. In this study, we applied a regression analysis based on several computational intelligence methods to a dataset of electronic health records of 73 patients with diabetes type 1 with 20 variables and another dataset of records of 400 patients of diabetes type 2 with 49 variables. Among the algorithms applied, Random Forests was able to outperform the other ones and to efficiently predict diabetes duration for both the cohorts, with the regression performances measured through the coefficient of determination R2. Afterwards, we applied the same method for feature ranking, and we detected the most relevant factors of the clinical records correlated with past diabetes duration: age, insulin intake, and body-mass index. Our study discoveries can have profound impact on clinical practice: when the information about the duration of diabetes of patient is missing, medical doctors can use our tool and focus on age, insulin intake, and body-mass index to infer this important aspect. Regarding limitations, unfortunately we were unable to find additional dataset of EHRs of patients with diabetes having the same variables of the two analyzed here, so we could not verify our findings on a validation cohort.

14.
BioData Min ; 17(1): 28, 2024 Sep 03.
Artículo en Inglés | MEDLINE | ID: mdl-39227987

RESUMEN

Pangenomics is a relatively new scientific field which investigates the union of all the genomes of a clade. The word pan means everything in ancient Greek; the term pangenomics originally regarded genomes of bacteria and was later intended to refer to human genomes as well. Modern bioinformatics offers several tools to analyze pangenomics data, paving the way to an emerging field that we can call computational pangenomics. Current computational power available for the bioinformatics community has made computational pangenomic analyses easy to perform, but this higher accessibility to pangenomics analysis also increases the chances to make mistakes and to produce misleading or inflated results, especially by beginners. To handle this problem, we present here a few quick tips for efficient and correct computational pangenomic analyses with a focus on bacterial pangenomics, by describing common mistakes to avoid and experienced best practices to follow in this field. We believe our recommendations can help the readers perform more robust and sound pangenomic analyses and to generate more reliable results.

15.
PeerJ Comput Sci ; 10: e2256, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39314688

RESUMEN

Electroencephalography (EEG) is a medical engineering technique aimed at recording the electric activity of the human brain. Brain signals derived from an EEG device can be processed and analyzed through computers by using digital signal processing, computational statistics, and machine learning techniques, that can lead to scientifically-relevant results and outcomes about how the brain works. In the last decades, the spread of EEG devices and the higher availability of EEG data, of computational resources, and of software packages for electroencephalography analysis has made EEG signal processing easier and faster to perform for any researcher worldwide. This increased ease to carry out computational analyses of EEG data, however, has made it easier to make mistakes, as well. And these mistakes, if unnoticed or treated wrongly, can in turn lead to wrong results or misleading outcomes, with worrisome consequences for patients and for the advancements of the knowledge about human brain. To tackle this problem, we present here our ten quick tips to perform electroencephalography signal processing analyses avoiding common mistakes: a short list of guidelines designed for beginners on what to do, how to do it, and what not to do when analyzing EEG data with a computer. We believe that following our quick recommendations can lead to better, more reliable and more robust results and outcome in clinical neuroscientific research.

16.
J Healthc Inform Res ; 8(1): 1-18, 2024 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-38273986

RESUMEN

Glioblastoma multiforme (GM) is a malignant tumor of the central nervous system considered to be highly aggressive and often carrying a terrible survival prognosis. An accurate prognosis is therefore pivotal for deciding a good treatment plan for patients. In this context, computational intelligence applied to data of electronic health records (EHRs) of patients diagnosed with this disease can be useful to predict the patients' survival time. In this study, we evaluated different machine learning models to predict survival time in patients suffering from glioblastoma and further investigated which features were the most predictive for survival time. We applied our computational methods to three different independent open datasets of EHRs of patients with glioblastoma: the Shieh dataset of 84 patients, the Berendsen dataset of 647 patients, and the Lammer dataset of 60 patients. Our survival time prediction techniques obtained concordance index (C-index) = 0.583 in the Shieh dataset, C-index = 0.776 in the Berendsen dataset, and C-index = 0.64 in the Lammer dataset, as best results in each dataset. Since the original studies regarding the three datasets analyzed here did not provide insights about the most predictive clinical features for survival time, we investigated the feature importance among these datasets. To this end, we then utilized Random Survival Forests, which is a decision tree-based algorithm able to model non-linear interaction between different features and might be able to better capture the highly complex clinical and genetic status of these patients. Our discoveries can impact clinical practice, aiding clinicians and patients alike to decide which therapy plan is best suited for their unique clinical status.

17.
PeerJ Comput Sci ; 10: e2295, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39314696

RESUMEN

The electrocardiogram (ECG) is a powerful tool to measure the electrical activity of the heart, and the analysis of its data can be useful to assess the patient's health. In particular, the computational analysis of electrocardiogram data, also called ECG signal processing, can reveal specific patterns or heart cycle trends which otherwise would be unnoticeable by medical experts. When performing ECG signal processing, however, it is easy to make mistakes and generate inflated, overoptimistic, or misleading results, which can lead to wrong diagnoses or prognoses and, in turn, could even contribute to bad medical decisions, damaging the health of the patient. Therefore, to avoid common mistakes and bad practices, we present here ten easy guidelines to follow when analyzing electrocardiogram data computationally. Our ten recommendations, written in a simple way, can be useful to anyone performing a computational study based on ECG data and eventually lead to better, more robust medical results.

18.
PLOS Digit Health ; 3(3): e0000459, 2024 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-38489347

RESUMEN

BACKGROUND: Systemic inflammatory response syndrome (SIRS) and sepsis are the most common causes of in-hospital death. However, the characteristics associated with the improvement in the patient conditions during the ICU stay were not fully elucidated for each population as well as the possible differences between the two. GOAL: The aim of this study is to highlight the differences between the prognostic clinical features for the survival of patients diagnosed with SIRS and those of patients diagnosed with sepsis by using a multi-variable predictive modeling approach with a reduced set of easily available measurements collected at the admission to the intensive care unit (ICU). METHODS: Data were collected from 1,257 patients (816 non-sepsis SIRS and 441 sepsis) admitted to the ICU. We compared the performance of five machine learning models in predicting patient survival. Matthews correlation coefficient (MCC) was used to evaluate model performances and feature importance, and by applying Monte Carlo stratified Cross-Validation. RESULTS: Extreme Gradient Boosting (MCC = 0.489) and Logistic Regression (MCC = 0.533) achieved the highest results for SIRS and sepsis cohorts, respectively. In order of importance, APACHE II, mean platelet volume (MPV), eosinophil counts (EoC), and C-reactive protein (CRP) showed higher importance for predicting sepsis patient survival, whereas, SOFA, APACHE II, platelet counts (PLTC), and CRP obtained higher importance in the SIRS cohort. CONCLUSION: By using complete blood count parameters as predictors of ICU patient survival, machine learning models can accurately predict the survival of SIRS and sepsis ICU patients. Interestingly, feature importance highlights the role of CRP and APACHE II in both SIRS and sepsis populations. In addition, MPV and EoC are shown to be important features for the sepsis population only, whereas SOFA and PLTC have higher importance for SIRS patients.

19.
BioData Min ; 16(1): 4, 2023 Feb 17.
Artículo en Inglés | MEDLINE | ID: mdl-36800973

RESUMEN

Binary classification is a common task for which machine learning and computational statistics are used, and the area under the receiver operating characteristic curve (ROC AUC) has become the common standard metric to evaluate binary classifications in most scientific fields. The ROC curve has true positive rate (also called sensitivity or recall) on the y axis and false positive rate on the x axis, and the ROC AUC can range from 0 (worst result) to 1 (perfect result). The ROC AUC, however, has several flaws and drawbacks. This score is generated including predictions that obtained insufficient sensitivity and specificity, and moreover it does not say anything about positive predictive value (also known as precision) nor negative predictive value (NPV) obtained by the classifier, therefore potentially generating inflated overoptimistic results. Since it is common to include ROC AUC alone without precision and negative predictive value, a researcher might erroneously conclude that their classification was successful. Furthermore, a given point in the ROC space does not identify a single confusion matrix nor a group of matrices sharing the same MCC value. Indeed, a given (sensitivity, specificity) pair can cover a broad MCC range, which casts doubts on the reliability of ROC AUC as a performance measure. In contrast, the Matthews correlation coefficient (MCC) generates a high score in its [Formula: see text] interval only if the classifier scored a high value for all the four basic rates of the confusion matrix: sensitivity, specificity, precision, and negative predictive value. A high MCC (for example, MCC [Formula: see text] 0.9), moreover, always corresponds to a high ROC AUC, and not vice versa. In this short study, we explain why the Matthews correlation coefficient should replace the ROC AUC as standard statistic in all the scientific studies involving a binary classification, in all scientific fields.

20.
BioData Min ; 16(1): 6, 2023 Feb 23.
Artículo en Inglés | MEDLINE | ID: mdl-36823520

RESUMEN

Bioinformatics has become a key aspect of the biomedical research programmes of many hospitals' scientific centres, and the establishment of bioinformatics facilities within hospitals has become a common practice worldwide. Bioinformaticians working in these facilities provide computational biology support to medical doctors and principal investigators who are daily dealing with data of patients to analyze. These bioinformatics analysts, although pivotal, usually do not receive formal training for this job. We therefore propose these ten simple rules to guide these bioinformaticians in their work: ten pieces of advice on how to provide bioinformatics support to medical doctors in hospitals. We believe these simple rules can help bioinformatics facility analysts in producing better scientific results and work in a serene and fruitful environment.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA