Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 185
Filtrar
1.
J Bioinform Comput Biol ; 22(2): 2471001, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38779779

RESUMEN

ChatGPT, a recently developed product by openAI, is successfully leaving its mark as a multi-purpose natural language based chatbot. In this paper, we are more interested in analyzing its potential in the field of computational biology. A major share of work done by computational biologists these days involve coding up bioinformatics algorithms, analyzing data, creating pipelining scripts and even machine learning modeling and feature extraction. This paper focuses on the potential influence (both positive and negative) of ChatGPT in the mentioned aspects with illustrative examples from different perspectives. Compared to other fields of computer science, computational biology has (1) less coding resources, (2) more sensitivity and bias issues (deals with medical data), and (3) more necessity of coding assistance (people from diverse background come to this field). Keeping such issues in mind, we cover use cases such as code writing, reviewing, debugging, converting, refactoring, and pipelining using ChatGPT from the perspective of computational biologists in this paper.


Asunto(s)
Algoritmos , Biología Computacional , Biología Computacional/métodos , Programas Informáticos , Lenguajes de Programación , Humanos , Procesamiento de Lenguaje Natural , Aprendizaje Automático
2.
Environ Res ; 250: 118523, 2024 Jun 01.
Artículo en Inglés | MEDLINE | ID: mdl-38382664

RESUMEN

BACKGROUND: Most previous research on the environmental epidemiology of childhood atopic eczema, rhinitis and wheeze is limited in the scope of risk factors studied. Our study adopted a machine learning approach to explore the role of the exposome starting already in the preconception phase. METHODS: We performed a combined analysis of two multi-ethnic Asian birth cohorts, the Growing Up in Singapore Towards healthy Outcomes (GUSTO) and the Singapore PREconception Study of long Term maternal and child Outcomes (S-PRESTO) cohorts. Interviewer-administered questionnaires were used to collect information on demography, lifestyle and childhood atopic eczema, rhinitis and wheeze development. Data training was performed using XGBoost, genetic algorithm and logistic regression models, and the top variables with the highest importance were identified. Additive explanation values were identified and inputted into a final multiple logistic regression model. Generalised structural equation modelling with maternal and child blood micronutrients, metabolites and cytokines was performed to explain possible mechanisms. RESULTS: The final study population included 1151 mother-child pairs. Our findings suggest that these childhood diseases are likely programmed in utero by the preconception and pregnancy exposomes through inflammatory pathways. We identified preconception alcohol consumption and maternal depressive symptoms during pregnancy as key modifiable maternal environmental exposures that increased eczema and rhinitis risk. Our mechanistic model suggested that higher maternal blood neopterin and child blood dimethylglycine protected against early childhood wheeze. After birth, early infection was a key driver of atopic eczema and rhinitis development. CONCLUSION: Preconception and antenatal exposomes can programme atopic eczema, rhinitis and wheeze development in utero. Reducing maternal alcohol consumption during preconception and supporting maternal mental health during pregnancy may prevent atopic eczema and rhinitis by promoting an optimal antenatal environment. Our findings suggest a need to include preconception environmental exposures in future research to counter the earliest precursors of disease development in children.


Asunto(s)
Dermatitis Atópica , Exposoma , Aprendizaje Automático , Ruidos Respiratorios , Rinitis , Humanos , Dermatitis Atópica/epidemiología , Femenino , Rinitis/epidemiología , Masculino , Preescolar , Singapur/epidemiología , Embarazo , Exposición Materna , Niño , Adulto , Efectos Tardíos de la Exposición Prenatal/epidemiología , Lactante , Estudios de Cohortes
3.
Brief Bioinform ; 24(4)2023 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-37419612

RESUMEN

Missing values (MVs) can adversely impact data analysis and machine-learning model development. We propose a novel mixed-model method for missing value imputation (MVI). This method, ProJect (short for Protein inJection), is a powerful and meaningful improvement over existing MVI methods such as Bayesian principal component analysis (PCA), probabilistic PCA, local least squares and quantile regression imputation of left-censored data. We rigorously tested ProJect on various high-throughput data types, including genomics and mass spectrometry (MS)-based proteomics. Specifically, we utilized renal cancer (RC) data acquired using DIA-SWATH, ovarian cancer (OC) data acquired using DIA-MS, bladder (BladderBatch) and glioblastoma (GBM) microarray gene expression dataset. Our results demonstrate that ProJect consistently performs better than other referenced MVI methods. It achieves the lowest normalized root mean square error (on average, scoring 45.92% less error in RC_C, 27.37% in RC_full, 29.22% in OC, 23.65% in BladderBatch and 20.20% in GBM relative to the closest competing method) and the Procrustes sum of squared error (Procrustes SS) (exhibits 79.71% less error in RC_C, 38.36% in RC full, 18.13% in OC, 74.74% in BladderBatch and 30.79% in GBM compared to the next best method). ProJect also leads with the highest correlation coefficient among all types of MV combinations (0.64% higher in RC_C, 0.24% in RC full, 0.55% in OC, 0.39% in BladderBatch and 0.27% in GBM versus the second-best performing method). ProJect's key strength is its ability to handle different types of MVs commonly found in real-world data. Unlike most MVI methods that are designed to handle only one type of MV, ProJect employs a decision-making algorithm that first determines if an MV is missing at random or missing not at random. It then employs targeted imputation strategies for each MV type, resulting in more accurate and reliable imputation outcomes. An R implementation of ProJect is available at https://github.com/miaomiao6606/ProJect.


Asunto(s)
Algoritmos , Genómica , Teorema de Bayes , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Espectrometría de Masas/métodos
4.
Mol Cell ; 83(14): 2595-2611.e11, 2023 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-37421941

RESUMEN

RNA-binding proteins (RBPs) control RNA metabolism to orchestrate gene expression and, when dysfunctional, underlie human diseases. Proteome-wide discovery efforts predict thousands of RBP candidates, many of which lack canonical RNA-binding domains (RBDs). Here, we present a hybrid ensemble RBP classifier (HydRA), which leverages information from both intermolecular protein interactions and internal protein sequence patterns to predict RNA-binding capacity with unparalleled specificity and sensitivity using support vector machines (SVMs), convolutional neural networks (CNNs), and Transformer-based protein language models. Occlusion mapping by HydRA robustly detects known RBDs and predicts hundreds of uncharacterized RNA-binding associated domains. Enhanced CLIP (eCLIP) for HydRA-predicted RBP candidates reveals transcriptome-wide RNA targets and confirms RNA-binding activity for HydRA-predicted RNA-binding associated domains. HydRA accelerates construction of a comprehensive RBP catalog and expands the diversity of RNA-binding associated domains.


Asunto(s)
Aprendizaje Profundo , Hydra , Animales , Humanos , ARN/metabolismo , Unión Proteica , Sitios de Unión/genética , Hydra/genética , Hydra/metabolismo
5.
Drug Discov Today ; 28(9): 103661, 2023 09.
Artículo en Inglés | MEDLINE | ID: mdl-37301250

RESUMEN

In data-processing pipelines, upstream steps can influence downstream processes because of their sequential nature. Among these data-processing steps, batch effect (BE) correction (BEC) and missing value imputation (MVI) are crucial for ensuring data suitability for advanced modeling and reducing the likelihood of false discoveries. Although BEC-MVI interactions are not well studied, they are ultimately interdependent. Batch sensitization can improve the quality of MVI. Conversely, accounting for missingness also improves proper BE estimation in BEC. Here, we discuss how BEC and MVI are interconnected and interdependent. We show how batch sensitization can improve any MVI and bring attention to the idea of BE-associated missing values (BEAMs). Finally, we discuss how batch-class imbalance problems can be mitigated by borrowing ideas from machine learning.


Asunto(s)
Procesamiento Automatizado de Datos
6.
PLoS Comput Biol ; 19(3): e1010961, 2023 03.
Artículo en Inglés | MEDLINE | ID: mdl-36930671

RESUMEN

In mass spectrometry (MS)-based proteomics, protein inference from identified peptides (protein fragments) is a critical step. We present ProInfer (Protein Inference), a novel protein assembly method that takes advantage of information in biological networks. ProInfer assists recovery of proteins supported only by ambiguous peptides (a peptide which maps to more than one candidate protein) and enhances the statistical confidence for proteins supported by both unique and ambiguous peptides. Consequently, ProInfer rescues weakly supported proteins thereby improving proteome coverage. Evaluated across THP1 cell line, lung cancer and RAW267.4 datasets, ProInfer always infers the most numbers of true positives, in comparison to mainstream protein inference tools Fido, EPIFANY and PIA. ProInfer is also adept at retrieving differentially expressed proteins, signifying its usefulness for functional analysis and phenotype profiling. Source codes of ProInfer are available at https://github.com/PennHui2016/ProInfer.


Asunto(s)
Algoritmos , Péptidos , Péptidos/química , Proteoma/análisis , Espectrometría de Masas , Proteómica/métodos , Bases de Datos de Proteínas , Programas Informáticos
7.
J Bioinform Comput Biol ; 21(2): 2371001, 2023 04.
Artículo en Inglés | MEDLINE | ID: mdl-36938598

RESUMEN

Despite an exponential increase in publications on clinical prediction models over recent years, the number of models deployed in clinical practice remains fairly limited. In this paper, we identify common obstacles that impede effective deployment of prediction models in healthcare, and investigate their underlying causes. We observe a key underlying cause behind most obstacles - the improper development and evaluation of prediction models. Inherent heterogeneities in clinical data complicate the development and evaluation of clinical prediction models. Many of these heterogeneities in clinical data are unreported because they are deemed to be irrelevant, or due to privacy concerns. We provide real-life examples where failure to handle heterogeneities in clinical data, or sources of biases, led to the development of erroneous models. The purpose of this paper is to familiarize modeling practitioners with common sources of biases and heterogeneities in clinical data, both of which have to be dealt with to ensure proper development and evaluation of clinical prediction models. Proper model development and evaluation, together with complete and thorough reporting, are important prerequisites for a prediction model to be effectively deployed in healthcare.


Asunto(s)
Reglas de Decisión Clínica , Atención a la Salud , Modelos Estadísticos
8.
J Bioinform Comput Biol ; 21(1): 2350005, 2023 02.
Artículo en Inglés | MEDLINE | ID: mdl-36891972

RESUMEN

Some prediction methods use probability to rank their predictions, while some other prediction methods do not rank their predictions and instead use [Formula: see text]-values to support their predictions. This disparity renders direct cross-comparison of these two kinds of methods difficult. In particular, approaches such as the Bayes Factor upper Bound (BFB) for [Formula: see text]-value conversion may not make correct assumptions for this kind of cross-comparisons. Here, using a well-established case study on renal cancer proteomics and in the context of missing protein prediction, we demonstrate how to compare these two kinds of prediction methods using two different strategies. The first strategy is based on false discovery rate (FDR) estimation, which does not make the same naïve assumptions as BFB conversions. The second strategy is a powerful approach which we colloquially call "home ground testing". Both strategies perform better than BFB conversions. Thus, we recommend comparing prediction methods by standardization to a common performance benchmark such as a global FDR. And where this is not possible, we recommend reciprocal "home ground testing".


Asunto(s)
Proteínas , Proteómica , Teorema de Bayes , Probabilidad
9.
J Bioinform Comput Biol ; 20(6): 2271001, 2022 12.
Artículo en Inglés | MEDLINE | ID: mdl-36514873

RESUMEN

Clinical prediction models are widely used to predict adverse outcomes in patients, and are often employed to guide clinical decision-making. Clinical data typically consist of patients who received different treatments. Many prediction modeling studies fail to account for differences in patient treatment appropriately, which results in the development of prediction models that show poor accuracy and generalizability. In this paper, we list the most common methods used to handle patient treatments and discuss certain caveats associated with each method. We believe that proper handling of differences in patient treatment is crucial for the development of accurate and generalizable models. As different treatment strategies are employed for different diseases, the best approach to properly handle differences in patient treatment is specific to each individual situation. We use the Ma-Spore acute lymphoblastic leukemia data set as a case study to demonstrate the complexities associated with differences in patient treatment, and offer suggestions on incorporating treatment information during evaluation of prediction models. In clinical data, patients are typically treated on a case by case basis, with unique cases occurring more frequently than expected. Hence, there are many subtleties to consider during the analysis and evaluation of clinical prediction models.

10.
Nucleic Acids Res ; 50(21): e122, 2022 11 28.
Artículo en Inglés | MEDLINE | ID: mdl-36124665

RESUMEN

Tree- and linear-shaped cell differentiation trajectories have been widely observed in developmental biologies and can be also inferred through computational methods from single-cell RNA-sequencing datasets. However, trajectories with complicated topologies such as loops, disparate lineages and bifurcating hierarchy remain difficult to infer accurately. Here, we introduce a density-based trajectory inference method capable of constructing diverse shapes of topological patterns including the most intriguing bifurcations. The novelty of our method is a step to exploit overlapping probability distributions to identify transition states of cells for determining connectability between cell clusters, and another step to infer a stable trajectory through a base-topology guided iterative fitting. Our method precisely re-constructed various benchmark reference trajectories. As a case study to demonstrate practical usefulness, our method was tested on single-cell RNA sequencing profiles of blood cells of SARS-CoV-2-infected patients. We not only re-discovered the linear trajectory bridging the transition from IgM plasmablast cells to developing neutrophils, and also found a previously-undiscovered lineage which can be rigorously supported by differentially expressed gene analysis.


Asunto(s)
COVID-19 , Análisis de la Célula Individual , Humanos , Análisis de la Célula Individual/métodos , SARS-CoV-2 , COVID-19/genética , Diferenciación Celular/genética
11.
Sci Rep ; 12(1): 11358, 2022 07 05.
Artículo en Inglés | MEDLINE | ID: mdl-35790756

RESUMEN

Despite technological advances in proteomics, incomplete coverage and inconsistency issues persist, resulting in "data holes". These data holes cause the missing protein problem (MPP), where relevant proteins are persistently unobserved, or sporadically observed across samples, hindering biomarker discovery and proper functional characterization. Network-based approaches can provide powerful solutions for resolving these issues. Functional Class Scoring (FCS) is one such method that uses protein complex information to recover missing proteins with weak support. However, FCS has not been evaluated on more recent proteomic technologies with higher coverage, and there is no clear way to evaluate its performance. To address these issues, we devised a more rigorous evaluation schema based on cross-verification between technical replicates and evaluated its performance on data acquired under recent Data-Independent Acquisition (DIA) technologies (viz. SWATH). Although cross-replicate examination reveals some inconsistencies amongst same-class samples, tissue-differentiating signal is nonetheless strongly conserved, confirming that FCS selects for biologically meaningful networks. We also report that predicted missing proteins are statistically significant based on FCS p values. Despite limited cross-replicate verification rates, the predicted missing proteins as a whole have higher peptide support than non-predicted proteins. FCS also predicts missing proteins that are often lost due to weak specific peptide support.


Asunto(s)
Investigación Biomédica , Proteómica , Péptidos , Proteínas/química , Proteómica/métodos , Proyectos de Investigación
12.
Trends Biotechnol ; 40(9): 1029-1040, 2022 09.
Artículo en Inglés | MEDLINE | ID: mdl-35282901

RESUMEN

Batch effects (BEs) are technical biases that may confound analysis of high-throughput biotechnological data. BEs are complex and effective mitigation is highly context-dependent. In particular, the advent of high-resolution technologies such as single-cell RNA sequencing presents new challenges. We first cover how BE modeling differs between traditional datasets and the new data landscape. We also discuss new approaches for measuring and mitigating BEs, including whether a BE is significant enough to warrant correction. Even with the advent of machine learning and artificial intelligence, the increased complexity of next-generation biotechnological data means increased complexities in BE management. We forecast that BEs will not only remain relevant in the age of big data but will become even more important.


Asunto(s)
Inteligencia Artificial , Macrodatos , Aprendizaje Automático
13.
BMC Bioinformatics ; 23(1): 90, 2022 Mar 14.
Artículo en Inglés | MEDLINE | ID: mdl-35287576

RESUMEN

BACKGROUND: Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions. RESULTS: We present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset but also provides a much more accurate prediction for twilight zone proteins. CONCLUSIONS: EnsembleFam, a machine learning method to model protein families, can be used to better identify members with very low sequence homology. Using EnsembleFam protein functions can be predicted  using just sequence information with better accuracy than state-of-the-art methods.


Asunto(s)
Proteínas , Máquina de Vectores de Soporte , Humanos , Proteínas/metabolismo
14.
Data Brief ; 41: 107919, 2022 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-35198691

RESUMEN

We present four datasets on proteomics profiling of HeLa and SiHa cell lines associated with the research described in the paper "PROTREC: A probability-based approach for recovering missing proteins based on biological networks" [1]. Proteins in each cell line were acquired by two different data acquisition methods. The first was Data Dependent Acquisition-Parallel Accumulation Serial Fragmentation (DDA-PASEF) and the second was Parallel Accumulation-Serial Fragmentation combined with data-independent acquisition (diaPASEF) [2], [3]. Protein assembly was performed following search against the Swiss-Prot Human database using Peaks Studio for DDA datasets and Spectronaut for DIA datasets. The assembled result contains identified PSMs, peptides and proteins that are above threshold for each HeLa and SiHa sample. Coverage-wise, for DDA-PASEF, approximately 6,090 and 7,298 proteins were quantified for HeLa and SiHA sample, while13,339 and 8,773 proteins were quantified by diaPASEF for HeLa for SiHa sample, respectively. Consistency-wise, diaPASEF has fewer missing values (∼ 2%) compared to its DDA counterparts (∼5-7%). The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the iProX partner repository [4] with the dataset identifier PXD029773.

15.
J Proteomics ; 250: 104392, 2022 01 06.
Artículo en Inglés | MEDLINE | ID: mdl-34626823

RESUMEN

A novel network-based approach for predicting missing proteins (MPs) is proposed here. This approach, PROTREC (short for PROtein RECovery), dominates existing network-based methods - such as Functional Class Scoring (FCS), Hypergeometric Enrichment (HE), and Gene Set Enrichment Analysis (GSEA) - across a variety of proteomics datasets derived from different proteomics data acquisition paradigms: Higher PROTREC scores are much more closely correlated with higher recovery rates of MPs across sample replicates. The PROTREC score, unlike methods reporting p-values, can be directly interpreted as the probability that an unreported protein in a proteomic screen is actually present in the sample being screened. SIGNIFICANCE: Mass spectrometry (MS) has developed rapidly in recent years; however, an obvious proportion of proteins is still undetected, leading to missing protein problems. A few existing protein recovery methods are based on biological networks, but the performance is not satisfactory. We propose a new protein recovery method, PROTREC, a Bayesian-inspired approach based on biological networks, which shows exceptional performance across multiple validation strategies. It does not rely on peptide information, so it avoids the ambiguity issue that most protein assembly methods face.


Asunto(s)
Proteínas , Proteómica , Espectrometría de Masas , Probabilidad , Proteínas/química , Proteómica/métodos
16.
Drug Discov Today ; 27(3): 678-685, 2022 03.
Artículo en Inglés | MEDLINE | ID: mdl-34743902

RESUMEN

Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently derived data are very similar to each other, causing models to perform well regardless of how they are trained (i.e., the doppelgänger effect). Despite the abundance of data doppelgängers in biomedical data and their inflationary effects, they remain uncharacterized. We show their prevalence in biomedical data, demonstrate how doppelgängers arise, and provide proof of their confounding effects. To mitigate the doppelgänger effect, we recommend identifying data doppelgängers before the training-validation split.


Asunto(s)
Aprendizaje Automático , Reproducibilidad de los Resultados
17.
BMC Bioinformatics ; 22(1): 250, 2021 May 15.
Artículo en Inglés | MEDLINE | ID: mdl-33992077

RESUMEN

BACKGROUND: A pair of genes is defined as synthetically lethal if defects on both cause the death of the cell but a defect in only one of the two is compatible with cell viability. Ideally, if A and B are two synthetic lethal genes, inhibiting B should kill cancer cells with a defect on A, and should have no effects on normal cells. Thus, synthetic lethality can be exploited for highly selective cancer therapies, which need to exploit differences between normal and cancer cells. RESULTS: In this paper, we present a new method for predicting synthetic lethal (SL) gene pairs. As neighbouring genes in the genome have highly correlated profiles of copy number variations (CNAs), our method clusters proximal genes with a similar CNA profile, then predicts mutually exclusive group pairs, and finally identifies the SL gene pairs within each group pairs. For mutual-exclusion testing we use a graph-based method which takes into account the mutation frequencies of different subjects and genes. We use two different methods for selecting the pair of SL genes; the first is based on the gene essentiality measured in various conditions by means of the "Gene Activity Ranking Profile" GARP score; the second leverages the annotations of gene to biological pathways. CONCLUSIONS: This method is unique among current SL prediction approaches, it reduces false-positive SL predictions compared to previous methods, and it allows establishing explicit collateral lethality relationship of gene pairs within mutually exclusive group pairs.


Asunto(s)
Variaciones en el Número de Copia de ADN , Genes Letales , ADN
18.
Bioinformatics ; 37(3): 289-295, 2021 04 20.
Artículo en Inglés | MEDLINE | ID: mdl-32761066

RESUMEN

MOTIVATION: Existing genome assembly evaluation metrics provide only limited insight on specific aspects of genome assembly quality, and sometimes even disagree with each other. For better integrative comparison between assemblies, we propose, here, a new genome assembly evaluation metric, Pairwise Distance Reconstruction (PDR). It derives from a common concern in genetic studies, and takes completeness, contiguity, and correctness into consideration. We also propose an approximation implementation to accelerate PDR computation. RESULTS: Our results on publicly available datasets affirm PDR's ability to integratively assess the quality of a genome assembly. In fact, this is guaranteed by its definition. The results also indicated the error introduced by approximation is extremely small and thus negligible. AVAILABILITYAND IMPLEMENTATION: https://github.com/XLuyu/PDR. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Análisis de Secuencia de ADN
19.
Bioinformatics ; 37(6): 750-758, 2021 05 05.
Artículo en Inglés | MEDLINE | ID: mdl-33063094

RESUMEN

MOTIVATION: Infection with strains of different subtypes and the subsequent crossover reading between the two strands of genomic RNAs by host cells' reverse transcriptase are the main causes of the vast HIV-1 sequence diversity. Such inter-subtype genomic recombinants can become circulating recombinant forms (CRFs) after widespread transmissions in a population. Complete prediction of all the subtype sources of a CRF strain is a complicated machine learning problem. It is also difficult to understand whether a strain is an emerging new subtype and if so, how to accurately identify the new components of the genetic source. RESULTS: We introduce a multi-label learning algorithm for the complete prediction of multiple sources of a CRF sequence as well as the prediction of its chronological number. The prediction is strengthened by a voting of various multi-label learning methods to avoid biased decisions. In our steps, frequency and position features of the sequences are both extracted to capture signature patterns of pure subtypes and CRFs. The method was applied to 7185 HIV-1 sequences, comprising 5530 pure subtype sequences and 1655 CRF sequences. Results have demonstrated that the method can achieve very high accuracy (reaching 99%) in the prediction of the complete set of labels of HIV-1 recombinant forms. A few wrong predictions are actually incomplete predictions, very close to the complete set of genuine labels. AVAILABILITY AND IMPLEMENTATION: https://github.com/Runbin-tang/The-source-of-HIV-CRFs-prediction. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Infecciones por VIH , VIH-1 , Variación Genética , Infecciones por VIH/genética , VIH-1/genética , Humanos , Epidemiología Molecular , Filogenia
20.
Patterns (N Y) ; 1(8): 100129, 2020 Nov 13.
Artículo en Inglés | MEDLINE | ID: mdl-33294870

RESUMEN

We discuss the validation of machine learning models, which is standard practice in determining model efficacy and generalizability. We argue that internal validation approaches, such as cross-validation and bootstrap, cannot guarantee the quality of a machine learning model due to potentially biased training data and the complexity of the validation procedure itself. For better evaluating the generalization ability of a learned model, we suggest leveraging on external data sources from elsewhere as validation datasets, namely external validation. Due to the lack of research attractions on external validation, especially a well-structured and comprehensive study, we discuss the necessity for external validation and propose two extensions of the external validation approach that may help reveal the true domain-relevant model from a candidate set. Moreover, we also suggest a procedure to check whether a set of validation datasets is valid and introduce statistical reference points for detecting external data problems.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...