Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 62
Filtrar
1.
Cancer Sci ; 114(1): 281-294, 2023 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-36114746

RESUMO

Emerging evidence suggests that the prognosis of patients with lung adenocarcinoma can be determined from germline variants and transcript levels in nontumoral lung tissue. Gene expression data from noninvolved lung tissue of 483 lung adenocarcinoma patients were tested for correlation with overall survival using multivariable Cox proportional hazard and multivariate machine learning models. For genes whose transcript levels are associated with survival, we used genotype data from 414 patients to identify germline variants acting as cis-expression quantitative trait loci (eQTLs). Associations of eQTL variant genotypes with gene expression and survival were tested. Levels of four transcripts were inversely associated with survival by Cox analysis (CLCF1, hazard ratio [HR] = 1.53; CNTNAP1, HR = 2.17; DUSP14, HR = 1.78; and MT1F: HR = 1.40). Machine learning analysis identified a signature of transcripts associated with lung adenocarcinoma outcome that was largely overlapping with the transcripts identified by Cox analysis, including the three most significant genes (CLCF1, CNTNAP1, and DUSP14). Pathway analysis indicated that the signature is enriched for ECM components. We identified 32 cis-eQTLs for CNTNAP1, including 6 with an inverse correlation and 26 with a direct correlation between the number of minor alleles and transcript levels. Of these, all but one were prognostic: the six with an inverse correlation were associated with better prognosis (HR < 1) while the others were associated with worse prognosis. Our findings provide supportive evidence that genetic predisposition to lung adenocarcinoma outcome is a feature already present in patients' noninvolved lung tissue.


Assuntos
Adenocarcinoma de Pulmão , Neoplasias Pulmonares , Humanos , Predisposição Genética para Doença , Adenocarcinoma de Pulmão/genética , Pulmão/patologia , Genótipo , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patologia , Prognóstico , Polimorfismo de Nucleotídeo Único
2.
J Biomed Inform ; 144: 104426, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37352899

RESUMO

Even if assessing binary classifications is a common task in scientific research, no consensus on a single statistic summarizing the confusion matrix has been reached so far. In recent studies, we demonstrated the advantages of the Matthews correlation coefficient (MCC) over other popular rates such as cross-entropy error, F1 score, accuracy, balanced accuracy, bookmaker informedness, diagnostic odds ratio, Brier score, and Cohen's kappa. In this study, we compared the MCC to other two statistics: prevalence threshold (PT), frequently used in obstetrics and gynecology, and Fowlkes-Mallows index, a metric employed in fuzzy logic and drug discovery. Through the investigation of the mutual relations among three metrics and the study of some relevant use cases, we show that, when positive data elements and negative data elements have the same importance, the Matthews correlation coefficient can be more informative than its two competitors, even this time.


Assuntos
Algoritmos , Lógica Fuzzy , Prevalência , Descoberta de Drogas , Entropia
3.
BMC Med Inform Decis Mak ; 22(Suppl 6): 300, 2022 11 18.
Artigo em Inglês | MEDLINE | ID: mdl-36401328

RESUMO

BACKGROUND: The SI-CURA project (Soluzioni Innovative per la gestione del paziente e il follow up terapeutico della Colite UlceRosA) is an Italian initiative aimed at the development of artificial intelligence solutions to discriminate pathologies of different nature, including inflammatory bowel disease (IBD), namely Ulcerative Colitis (UC) and Crohn's disease (CD), based on endoscopic imaging of patients (P) and healthy controls (N). METHODS: In this study we develop a deep learning (DL) prototype to identify disease patterns through three binary classification tasks, namely (1) discriminating positive (pathological) samples from negative (healthy) samples (P vs N); (2) discrimination between Ulcerative Colitis and Crohn's Disease samples (UC vs CD) and, (3) discrimination between Ulcerative Colitis and negative (healthy) samples (UC vs N). RESULTS: The model derived from our approach achieves a high performance of Matthews correlation coefficient (MCC) > 0.9 on the test set for P versus N and UC versus N, and MCC > 0.6 on the test set for UC versus CD. CONCLUSION: Our DL model effectively discriminates between pathological and negative samples, as well as between IBD subgroups, providing further evidence of its potential as a decision support tool for endoscopy-based diagnosis.


Assuntos
Colite Ulcerativa , Doença de Crohn , Doenças Inflamatórias Intestinais , Humanos , Colite Ulcerativa/diagnóstico por imagem , Colite Ulcerativa/patologia , Doença de Crohn/diagnóstico por imagem , Doença de Crohn/patologia , Inteligência Artificial , Endoscopia
4.
Int J Mol Sci ; 22(16)2021 Aug 16.
Artigo em Inglês | MEDLINE | ID: mdl-34445517

RESUMO

We introduce here a novel machine learning (ML) framework to address the issue of the quantitative assessment of the immune content in neuroblastoma (NB) specimens. First, the EUNet, a U-Net with an EfficientNet encoder, is trained to detect lymphocytes on tissue digital slides stained with the CD3 T-cell marker. The training set consists of 3782 images extracted from an original collection of 54 whole slide images (WSIs), manually annotated for a total of 73,751 lymphocytes. Resampling strategies, data augmentation, and transfer learning approaches are adopted to warrant reproducibility and to reduce the risk of overfitting and selection bias. Topological data analysis (TDA) is then used to define activation maps from different layers of the neural network at different stages of the training process, described by persistence diagrams (PD) and Betti curves. TDA is further integrated with the uniform manifold approximation and projection (UMAP) dimensionality reduction and the hierarchical density-based spatial clustering of applications with noise (HDBSCAN) algorithm for clustering, by the deep features, the relevant subgroups and structures, across different levels of the neural network. Finally, the recent TwoNN approach is leveraged to study the variation of the intrinsic dimensionality of the U-Net model. As the main task, the proposed pipeline is employed to evaluate the density of lymphocytes over the whole tissue area of the WSIs. The model achieves good results with mean absolute error 3.1 on test set, showing significant agreement between densities estimated by our EUNet model and by trained pathologists, thus indicating the potentialities of a promising new strategy in the quantification of the immune content in NB specimens. Moreover, the UMAP algorithm unveiled interesting patterns compatible with pathological characteristics, also highlighting novel insights into the dynamics of the intrinsic dataset dimensionality at different stages of the training process. All the experiments were run on the Microsoft Azure cloud platform.


Assuntos
Interpretação de Imagem Assistida por Computador/métodos , Neuroblastoma/imunologia , Computação em Nuvem , Aprendizado Profundo , Feminino , Humanos , Linfócitos/metabolismo , Masculino , Redes Neurais de Computação , Neuroblastoma/diagnóstico por imagem
5.
BMC Genomics ; 21(1): 6, 2020 Jan 02.
Artigo em Inglês | MEDLINE | ID: mdl-31898477

RESUMO

BACKGROUND: To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets. RESULTS: The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset. CONCLUSIONS: In this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F1 score in evaluating binary classification tasks by all scientific communities.


Assuntos
Correlação de Dados , Interpretação Estatística de Dados , Aprendizado de Máquina/estatística & dados numéricos , Algoritmos , Biologia Computacional/estatística & dados numéricos
6.
PLoS Comput Biol ; 15(3): e1006269, 2019 03.
Artigo em Inglês | MEDLINE | ID: mdl-30917113

RESUMO

Artificial Intelligence is exponentially increasing its impact on healthcare. As deep learning is mastering computer vision tasks, its application to digital pathology is natural, with the promise of aiding in routine reporting and standardizing results across trials. Deep learning features inferred from digital pathology scans can improve validity and robustness of current clinico-pathological features, up to identifying novel histological patterns, e.g., from tumor infiltrating lymphocytes. In this study, we examine the issue of evaluating accuracy of predictive models from deep learning features in digital pathology, as an hallmark of reproducibility. We introduce the DAPPER framework for validation based on a rigorous Data Analysis Plan derived from the FDA's MAQC project, designed to analyze causes of variability in predictive biomarkers. We apply the framework on models that identify tissue of origin on 787 Whole Slide Images from the Genotype-Tissue Expression (GTEx) project. We test three different deep learning architectures (VGG, ResNet, Inception) as feature extractors and three classifiers (a fully connected multilayer, Support Vector Machine and Random Forests) and work with four datasets (5, 10, 20 or 30 classes), for a total of 53, 000 tiles at 512 × 512 resolution. We analyze accuracy and feature stability of the machine learning classifiers, also demonstrating the need for diagnostic tests (e.g., random labels) to identify selection bias and risks for reproducibility. Further, we use the deep features from the VGG model from GTEx on the KIMIA24 dataset for identification of slide of origin (24 classes) to train a classifier on 1, 060 annotated tiles and validated on 265 unseen ones. The DAPPER software, including its deep learning pipeline and the Histological Imaging-Newsy Tiles (HINT) benchmark dataset derived from GTEx, is released as a basis for standardization and validation initiatives in AI for digital pathology.


Assuntos
Algoritmos , Inteligência Artificial , Técnicas Histológicas/métodos , Interpretação de Imagem Assistida por Computador/métodos , Software , Humanos , Reprodutibilidade dos Testes
7.
BMC Med Inform Decis Mak ; 20(1): 16, 2020 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-32013925

RESUMO

BACKGROUND: Cardiovascular diseases kill approximately 17 million people globally every year, and they mainly exhibit as myocardial infarctions and heart failures. Heart failure (HF) occurs when the heart cannot pump enough blood to meet the needs of the body.Available electronic medical records of patients quantify symptoms, body features, and clinical laboratory test values, which can be used to perform biostatistics analysis aimed at highlighting patterns and correlations otherwise undetectable by medical doctors. Machine learning, in particular, can predict patients' survival from their data and can individuate the most important features among those included in their medical records. METHODS: In this paper, we analyze a dataset of 299 patients with heart failure collected in 2015. We apply several machine learning classifiers to both predict the patients survival, and rank the features corresponding to the most important risk factors. We also perform an alternative feature ranking analysis by employing traditional biostatistics tests, and compare these results with those provided by the machine learning algorithms. Since both feature ranking approaches clearly identify serum creatinine and ejection fraction as the two most relevant features, we then build the machine learning survival prediction models on these two factors alone. RESULTS: Our results of these two-feature models show not only that serum creatinine and ejection fraction are sufficient to predict survival of heart failure patients from medical records, but also that using these two features alone can lead to more accurate predictions than using the original dataset features in its entirety. We also carry out an analysis including the follow-up month of each patient: even in this case, serum creatinine and ejection fraction are the most predictive clinical features of the dataset, and are sufficient to predict patients' survival. CONCLUSIONS: This discovery has the potential to impact on clinical practice, becoming a new supporting tool for physicians when predicting if a heart failure patient will survive or not. Indeed, medical doctors aiming at understanding if a patient will survive after heart failure may focus mainly on serum creatinine and ejection fraction.


Assuntos
Creatinina/sangue , Insuficiência Cardíaca/fisiopatologia , Aprendizado de Máquina , Volume Sistólico/fisiologia , Análise de Sobrevida , Adulto , Idoso , Idoso de 80 Anos ou mais , Algoritmos , Bases de Dados Factuais , Registros Eletrônicos de Saúde/estatística & dados numéricos , Feminino , Previsões , Humanos , Masculino , Pessoa de Meia-Idade , Fatores de Risco
8.
BMC Bioinformatics ; 19(Suppl 2): 49, 2018 03 08.
Artigo em Inglês | MEDLINE | ID: mdl-29536822

RESUMO

BACKGROUND: Convolutional Neural Networks can be effectively used only when data are endowed with an intrinsic concept of neighbourhood in the input space, as is the case of pixels in images. We introduce here Ph-CNN, a novel deep learning architecture for the classification of metagenomics data based on the Convolutional Neural Networks, with the patristic distance defined on the phylogenetic tree being used as the proximity measure. The patristic distance between variables is used together with a sparsified version of MultiDimensional Scaling to embed the phylogenetic tree in a Euclidean space. RESULTS: Ph-CNN is tested with a domain adaptation approach on synthetic data and on a metagenomics collection of gut microbiota of 38 healthy subjects and 222 Inflammatory Bowel Disease patients, divided in 6 subclasses. Classification performance is promising when compared to classical algorithms like Support Vector Machines and Random Forest and a baseline fully connected neural network, e.g. the Multi-Layer Perceptron. CONCLUSION: Ph-CNN represents a novel deep learning approach for the classification of metagenomics data. Operatively, the algorithm has been implemented as a custom Keras layer taking care of passing to the following convolutional layer not only the data but also the ranked list of neighbourhood of each sample, thus mimicking the case of image data, transparently to the user.


Assuntos
Metagenômica , Redes Neurais de Computação , Filogenia , Algoritmos , Análise de Dados , Bases de Dados Genéticas , Humanos , Doenças Inflamatórias Intestinais/genética , Análise de Componente Principal , Reprodutibilidade dos Testes , Máquina de Vetores de Suporte
9.
Artigo em Inglês | MEDLINE | ID: mdl-30628533

RESUMO

We introduce here ML4Tox, a framework offering Deep Learning and Support Vector Machine models to predict agonist, antagonist, and binding activities of chemical compounds, in this case for the estrogen receptor ligand-binding domain. The ML4Tox models have been developed with a 10 × 5-fold cross-validation schema on the training portion of the CERAPP ToxCast dataset, formed by 1677 chemicals, each described by 777 molecular features. On the CERAPP "All Literature" evaluation set (agonist: 6319 compounds; antagonist 6539; binding 7283), ML4Tox significantly improved sensitivity over published results on all three tasks, with agonist: 0.78 vs 0.56; antagonist: 0.69 vs 0.11; binding: 0.66 vs 0.26.


Assuntos
Simulação por Computador , Disruptores Endócrinos/toxicidade , Poluentes Ambientais/toxicidade , Aprendizado de Máquina , Testes de Toxicidade/métodos , Ligação Proteica , Relação Quantitativa Estrutura-Atividade , Receptores de Estrogênio , Máquina de Vetores de Suporte
10.
BMC Bioinformatics ; 17(1): 542, 2016 Dec 20.
Artigo em Inglês | MEDLINE | ID: mdl-27998275

RESUMO

BACKGROUND: Networks are popular and powerful tools to describe and model biological processes. Many computational methods have been developed to infer biological networks from literature, high-throughput experiments, and combinations of both. Additionally, a wide range of tools has been developed to map experimental data onto reference biological networks, in order to extract meaningful modules. Many of these methods assess results' significance against null distributions of randomized networks. However, these standard unconstrained randomizations do not preserve the functional characterization of the nodes in the reference networks (i.e. their degrees and connection signs), hence including potential biases in the assessment. RESULTS: Building on our previous work about rewiring bipartite networks, we propose a method for rewiring any type of unweighted networks. In particular we formally demonstrate that the problem of rewiring a signed and directed network preserving its functional connectivity (F-rewiring) reduces to the problem of rewiring two induced bipartite networks. Additionally, we reformulate the lower bound to the iterations' number of the switching-algorithm to make it suitable for the F-rewiring of networks of any size. Finally, we present BiRewire3, an open-source Bioconductor package enabling the F-rewiring of any type of unweighted network. We illustrate its application to a case study about the identification of modules from gene expression data mapped on protein interaction networks, and a second one focused on building logic models from more complex signed-directed reference signaling networks and phosphoproteomic data. CONCLUSIONS: BiRewire3 it is freely available at https://www.bioconductor.org/packages/BiRewire/ , and it should have a broad application as it allows an efficient and analytically derived statistical assessment of results from any network biology tool.


Assuntos
Biologia Computacional/métodos , Modelos Biológicos , Algoritmos , Interpretação Estatística de Dados , Redes Reguladoras de Genes , Humanos , Mapas de Interação de Proteínas , Distribuição Aleatória , Software
11.
Bioinformatics ; 30(17): i617-23, 2014 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-25161255

RESUMO

MOTIVATION: Studying combinatorial patterns in cancer genomic datasets has recently emerged as a tool for identifying novel cancer driver networks. Approaches have been devised to quantify, for example, the tendency of a set of genes to be mutated in a 'mutually exclusive' manner. The significance of the proposed metrics is usually evaluated by computing P-values under appropriate null models. To this end, a Monte Carlo method (the switching-algorithm) is used to sample simulated datasets under a null model that preserves patient- and gene-wise mutation rates. In this method, a genomic dataset is represented as a bipartite network, to which Markov chain updates (switching-steps) are applied. These steps modify the network topology, and a minimal number of them must be executed to draw simulated datasets independently under the null model. This number has previously been deducted empirically to be a linear function of the total number of variants, making this process computationally expensive. RESULTS: We present a novel approximate lower bound for the number of switching-steps, derived analytically. Additionally, we have developed the R package BiRewire, including new efficient implementations of the switching-algorithm. We illustrate the performances of BiRewire by applying it to large real cancer genomics datasets. We report vast reductions in time requirement, with respect to existing implementations/bounds and equivalent P-value computations. Thus, we propose BiRewire to study statistical properties in genomic datasets, and other data that can be modeled as bipartite networks. AVAILABILITY AND IMPLEMENTATION: BiRewire is available on BioConductor at http://www.bioconductor.org/packages/2.13/bioc/html/BiRewire.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica/métodos , Algoritmos , Humanos , Cadeias de Markov , Método de Monte Carlo , Neoplasias/genética , Distribuição Aleatória , Software
12.
Bioinformatics ; 29(3): 407-8, 2013 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-23242262

RESUMO

UNLABELLED: We introduce a novel implementation in ANSI C of the MINE family of algorithms for computing maximal information-based measures of dependence between two variables in large datasets, with the aim of a low memory footprint and ease of integration within bioinformatics pipelines. We provide the libraries minerva (with the R interface) and minepy for Python, MATLAB, Octave and C++. The C solution reduces the large memory requirement of the original Java implementation, has good upscaling properties and offers a native parallelization for the R interface. Low memory requirements are demonstrated on the MINE benchmarks as well as on large ( = 1340) microarray and Illumina GAII RNA-seq transcriptomics datasets. AVAILABILITY AND IMPLEMENTATION: Source code and binaries are freely available for download under GPL3 licence at http://minepy.sourceforge.net for minepy and through the CRAN repository http://cran.r-project.org for the R package minerva. All software is multiplatform (MS Windows, Linux and OSX).


Assuntos
Software , Algoritmos , Biologia Computacional , Mineração de Dados , Perfilação da Expressão Gênica , Metagenoma
13.
Sci Rep ; 14(1): 2847, 2024 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-38310171

RESUMO

Autosomal dominant polycystic kidney disease (ADPKD) is a monogenic, rare disease, characterized by the formation of multiple cysts that grow out of the renal tubules. Despite intensive attempts to develop new drugs or repurpose existing ones, there is currently no definitive cure for ADPKD. This is primarily due to the complex and variable pathogenesis of the disease and the lack of models that can faithfully reproduce the human phenotype. Therefore, the development of models that allow automated detection of cysts' growth directly on human kidney tissue is a crucial step in the search for efficient therapeutic solutions. Artificial Intelligence methods, and deep learning algorithms in particular, can provide powerful and effective solutions to such tasks, and indeed various architectures have been proposed in the literature in recent years. Here, we comparatively review state-of-the-art deep learning segmentation models, using as a testbed a set of sequential RGB immunofluorescence images from 4 in vitro experiments with 32 engineered polycystic kidney tubules. To gain a deeper understanding of the detection process, we implemented both pixel-wise and cyst-wise performance metrics to evaluate the algorithms. Overall, two models stand out as the best performing, namely UNet++ and UACANet: the latter uses a self-attention mechanism introducing some explainability aspects that can be further exploited in future developments, thus making it the most promising algorithm to build upon towards a more refined cyst-detection platform. UACANet model achieves a cyst-wise Intersection over Union of 0.83, 0.91 for Recall, and 0.92 for Precision when applied to detect large-size cysts. On all-size cysts, UACANet averages at 0.624 pixel-wise Intersection over Union. The code to reproduce all results is freely available in a public GitHub repository.


Assuntos
Cistos , Rim Policístico Autossômico Dominante , Humanos , Rim Policístico Autossômico Dominante/patologia , Inteligência Artificial , Rim/diagnóstico por imagem , Rim/patologia , Túbulos Renais , Cistos/diagnóstico por imagem , Cistos/patologia
14.
PLoS One ; 19(3): e0300127, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38483951

RESUMO

BACKGROUND: The burden of Parkinson Disease (PD) represents a key public health issue and it is essential to develop innovative and cost-effective approaches to promote sustainable diagnostic and therapeutic interventions. In this perspective the adoption of a P3 (predictive, preventive and personalized) medicine approach seems to be pivotal. The NeuroArtP3 (NET-2018-12366666) is a four-year multi-site project co-funded by the Italian Ministry of Health, bringing together clinical and computational centers operating in the field of neurology, including PD. OBJECTIVE: The core objectives of the project are: i) to harmonize the collection of data across the participating centers, ii) to structure standardized disease-specific datasets and iii) to advance knowledge on disease's trajectories through machine learning analysis. METHODS: The 4-years study combines two consecutive research components: i) a multi-center retrospective observational phase; ii) a multi-center prospective observational phase. The retrospective phase aims at collecting data of the patients admitted at the participating clinical centers. Whereas the prospective phase aims at collecting the same variables of the retrospective study in newly diagnosed patients who will be enrolled at the same centers. RESULTS: The participating clinical centers are the Provincial Health Services (APSS) of Trento (Italy) as the center responsible for the PD study and the IRCCS San Martino Hospital of Genoa (Italy) as the promoter center of the NeuroartP3 project. The computational centers responsible for data analysis are the Bruno Kessler Foundation of Trento (Italy) with TrentinoSalute4.0 -Competence Center for Digital Health of the Province of Trento (Italy) and the LISCOMPlab University of Genoa (Italy). CONCLUSIONS: The work behind this observational study protocol shows how it is possible and viable to systematize data collection procedures in order to feed research and to advance the implementation of a P3 approach into the clinical practice through the use of AI models.


Assuntos
Inteligência Artificial , Doença de Parkinson , Humanos , Estudos Retrospectivos , Estudos Prospectivos , Doença de Parkinson/diagnóstico , Saúde Pública , Estudos Observacionais como Assunto , Estudos Multicêntricos como Assunto
15.
BioData Min ; 16(1): 4, 2023 Feb 17.
Artigo em Inglês | MEDLINE | ID: mdl-36800973

RESUMO

Binary classification is a common task for which machine learning and computational statistics are used, and the area under the receiver operating characteristic curve (ROC AUC) has become the common standard metric to evaluate binary classifications in most scientific fields. The ROC curve has true positive rate (also called sensitivity or recall) on the y axis and false positive rate on the x axis, and the ROC AUC can range from 0 (worst result) to 1 (perfect result). The ROC AUC, however, has several flaws and drawbacks. This score is generated including predictions that obtained insufficient sensitivity and specificity, and moreover it does not say anything about positive predictive value (also known as precision) nor negative predictive value (NPV) obtained by the classifier, therefore potentially generating inflated overoptimistic results. Since it is common to include ROC AUC alone without precision and negative predictive value, a researcher might erroneously conclude that their classification was successful. Furthermore, a given point in the ROC space does not identify a single confusion matrix nor a group of matrices sharing the same MCC value. Indeed, a given (sensitivity, specificity) pair can cover a broad MCC range, which casts doubts on the reliability of ROC AUC as a performance measure. In contrast, the Matthews correlation coefficient (MCC) generates a high score in its [Formula: see text] interval only if the classifier scored a high value for all the four basic rates of the confusion matrix: sensitivity, specificity, precision, and negative predictive value. A high MCC (for example, MCC [Formula: see text] 0.9), moreover, always corresponds to a high ROC AUC, and not vice versa. In this short study, we explain why the Matthews correlation coefficient should replace the ROC AUC as standard statistic in all the scientific studies involving a binary classification, in all scientific fields.

16.
BioData Min ; 16(1): 6, 2023 Feb 23.
Artigo em Inglês | MEDLINE | ID: mdl-36823520

RESUMO

Bioinformatics has become a key aspect of the biomedical research programmes of many hospitals' scientific centres, and the establishment of bioinformatics facilities within hospitals has become a common practice worldwide. Bioinformaticians working in these facilities provide computational biology support to medical doctors and principal investigators who are daily dealing with data of patients to analyze. These bioinformatics analysts, although pivotal, usually do not receive formal training for this job. We therefore propose these ten simple rules to guide these bioinformaticians in their work: ten pieces of advice on how to provide bioinformatics support to medical doctors in hospitals. We believe these simple rules can help bioinformatics facility analysts in producing better scientific results and work in a serene and fruitful environment.

17.
BioData Min ; 16(1): 7, 2023 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-36870971

RESUMO

Neuroblastoma is a childhood neurological tumor which affects hundreds of thousands of children worldwide, and information about its prognosis can be pivotal for patients, their families, and clinicians. One of the main goals in the related bioinformatics analyses is to provide stable genetic signatures able to include genes whose expression levels can be effective to predict the prognosis of the patients. In this study, we collected the prognostic signatures for neuroblastoma published in the biomedical literature, and noticed that the most frequent genes present among them were three: AHCY, DPYLS3, and NME1. We therefore investigated the prognostic power of these three genes by performing a survival analysis and a binary classification on multiple gene expression datasets of different groups of patients diagnosed with neuroblastoma. Finally, we discussed the main studies in the literature associating these three genes with neuroblastoma. Our results, in each of these three steps of validation, confirm the prognostic capability of AHCY, DPYLS3, and NME1, and highlight their key role in neuroblastoma prognosis. Our results can have an impact on neuroblastoma genetics research: biologists and medical researchers can pay more attention to the regulation and expression of these three genes in patients having neuroblastoma, and therefore can develop better cures and treatments which can save patients' lives.

18.
Comput Biol Med ; 152: 106373, 2023 01.
Artigo em Inglês | MEDLINE | ID: mdl-36462367

RESUMO

Systemic lupus erythematosus and primary Sjogren's syndrome are complex systemic autoimmune diseases that are often misdiagnosed. In this article, we demonstrate the potential of machine learning to perform differential diagnosis of these similar pathologies using gene expression and methylation data from 651 individuals. Furthermore, we analyzed the impact of the heterogeneity of these diseases on the performance of the predictive models, discovering that patients assigned to a specific molecular cluster are misclassified more often and affect to the overall performance of the predictive models. In addition, we found that the samples characterized by a high interferon activity are the ones predicted with more accuracy, followed by the samples with high inflammatory activity. Finally, we identified a group of biomarkers that improve the predictions compared to using the whole data and we validated them with external studies from other tissues and technological platforms.


Assuntos
Lúpus Eritematoso Sistêmico , Síndrome de Sjogren , Humanos , Síndrome de Sjogren/diagnóstico , Síndrome de Sjogren/genética , Diagnóstico Diferencial , Multiômica , Lúpus Eritematoso Sistêmico/diagnóstico , Lúpus Eritematoso Sistêmico/genética , Aprendizado de Máquina
19.
BioData Min ; 16(1): 33, 2023 Nov 25.
Artigo em Inglês | MEDLINE | ID: mdl-38001537

RESUMO

BACKGROUND: Discrimination between patients affected by inflammatory bowel diseases and healthy controls on the basis of endoscopic imaging is an challenging problem for machine learning models. Such task is used here as the testbed for a novel deep learning classification pipeline, powered by a set of solutions enhancing characterising elements such as reproducibility, interpretability, reduced computational workload, bias-free modeling and careful image preprocessing. RESULTS: First, an automatic preprocessing procedure is devised, aimed to remove artifacts from clinical data, feeding then the resulting images to an aggregated per-patient model to mimic the clinicians decision process. The predictions are based on multiple snapshots obtained through resampling, reducing the risk of misleading outcomes by removing the low confidence predictions. Each patient's outcome is explained by returning the images the prediction is based upon, supporting clinicians in verifying diagnoses without the need for evaluating the full set of endoscopic images. As a major theoretical contribution, quantization is employed to reduce the complexity and the computational cost of the model, allowing its deployment on small power devices with an almost negligible 3% performance degradation. Such quantization procedure holds relevance not only in the context of per-patient models but also for assessing its feasibility in providing real-time support to clinicians even in low-resources environments. The pipeline is demonstrated on a private dataset of endoscopic images of 758 IBD patients and 601 healthy controls, achieving Matthews Correlation Coefficient 0.9 as top performance on test set. CONCLUSION: We highlighted how a comprehensive pre-processing pipeline plays a crucial role in identifying and removing artifacts from data, solving one of the principal challenges encountered when working with clinical data. Furthermore, we constructively showed how it is possible to emulate clinicians decision process and how it offers significant advantages, particularly in terms of explainability and trust within the healthcare context. Last but not least, we proved that quantization can be a useful tool to reduce the time and resources consumption with an acceptable degradation of the model performs. The quantization study proposed in this work points up the potential development of real-time quantized algorithms as valuable tools to support clinicians during endoscopy procedures.

20.
Sci Total Environ ; 905: 167095, 2023 Dec 20.
Artigo em Inglês | MEDLINE | ID: mdl-37748607

RESUMO

Ongoing and future climate change driven expansion of aeroallergen-producing plant species comprise a major human health problem across Europe and elsewhere. There is an urgent need to produce accurate, temporally dynamic maps at the continental level, especially in the context of climate uncertainty. This study aimed to restore missing daily ragweed pollen data sets for Europe, to produce phenological maps of ragweed pollen, resulting in the most complete and detailed high-resolution ragweed pollen concentration maps to date. To achieve this, we have developed two statistical procedures, a Gaussian method (GM) and deep learning (DL) for restoring missing daily ragweed pollen data sets, based on the plant's reproductive and growth (phenological, pollen production and frost-related) characteristics. DL model performances were consistently better for estimating seasonal pollen integrals than those of the GM approach. These are the first published modelled maps using altitude correction and flowering phenology to recover missing pollen information. We created a web page (http://euragweedpollen.gmf.u-szeged.hu/), including daily ragweed pollen concentration data sets of the stations examined and their restored daily data, allowing one to upload newly measured or recovered daily data. Generation of these maps provides a means to track pollen impacts in the context of climatic shifts, identify geographical regions with high pollen exposure, determine areas of future vulnerability, apply spatially-explicit mitigation measures and prioritize management interventions.


Assuntos
Alérgenos , Ambrosia , Humanos , Europa (Continente) , Pólen
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA