Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 121
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
BMC Bioinformatics ; 25(1): 155, 2024 Apr 20.
Artículo en Inglés | MEDLINE | ID: mdl-38641616

RESUMEN

BACKGROUND: Classification of binary data arises naturally in many clinical applications, such as patient risk stratification through ICD codes. One of the key practical challenges in data classification using machine learning is to avoid overfitting. Overfitting in supervised learning primarily occurs when a model learns random variations from noisy labels in training data rather than the underlying patterns. While traditional methods such as regularization and early stopping have demonstrated effectiveness in interpolation tasks, addressing overfitting in the classification of binary data, in which predictions always amount to extrapolation, demands extrapolation-enhanced strategies. One such approach is hybrid mechanistic/data-driven modeling, which integrates prior knowledge on input features into the learning process, enhancing the model's ability to extrapolate. RESULTS: We present NoiseCut, a Python package for noise-tolerant classification of binary data by employing a hybrid modeling approach that leverages solutions of defined max-cut problems. In a comparative analysis conducted on synthetically generated binary datasets, NoiseCut exhibits better overfitting prevention compared to the early stopping technique employed by different supervised machine learning algorithms. The noise tolerance of NoiseCut stems from a dropout strategy that leverages prior knowledge of input features and is further enhanced by the integration of max-cut problems into the learning process. CONCLUSIONS: NoiseCut is a Python package for the implementation of hybrid modeling for the classification of binary data. It facilitates the integration of mechanistic knowledge on the input features into learning from data in a structured manner and proves to be a valuable classification tool when the available training data is noisy and/or limited in size. This advantage is especially prominent in medical and biomedical applications where data scarcity and noise are common challenges. The codebase, illustrations, and documentation for NoiseCut are accessible for download at https://pypi.org/project/noisecut/ . The implementation detailed in this paper corresponds to the version 0.2.1 release of the software.


Asunto(s)
Algoritmos , Programas Informáticos , Humanos , Aprendizaje Automático Supervisado , Aprendizaje Automático
2.
J Sci Food Agric ; 104(10): 6018-6034, 2024 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-38483173

RESUMEN

BACKGROUND: The accurate recognition and early warning for plant diseases and pests are a prerequisite of intelligent prevention and control for plant diseases and pests. As a result of the phenotype similarity of the hazarded plant after plant diseases and pests occur, as well as the interference of the external environment, traditional deep learning models often face the overfitting problem in phenotype recognition of plant diseases and pests, which leads to not only the slow convergence speed of the network, but also low recognition accuracy. RESULTS: Motivated by the above problems, the present study proposes a deep learning model EResNet-support vector machine (SVM) to alleviate the overfitting for the recognition and classification of plant diseases and pests. First, the feature extraction capability of the model is improved by increasing feature extraction layers in the convolutional neural network. Second, the order-reduced modules are embedded and a sparsely activated function is introduced to reduce model complexity and alleviate overfitting. Finally, a classifier fused by SVM and fully connected layers are introduced to transforms the original non-linear classification problem into a linear classification problem in high-dimensional space to further alleviate the overfitting and improve the recognition accuracy of plant diseases and pests. The ablation experiments further demonstrate that the fused structure can effectively alleviate the overfitting and improve the recognition accuracy. The experimental recognition results for typical plant diseases and pests show that the proposed EResNet-SVM model has 99.30% test accuracy for eight conditions (seven plant diseases and one normal), which is 5.90% higher than the original ResNet18. Compared with the classic AlexNet, GoogLeNet, Xception, SqueezeNet and DenseNet201 models, the accuracy of the EResNet-SVM model has improved by 5.10%, 7%, 8.10%, 6.20% and 1.90%, respectively. The testing accuracy of the EResNet-SVM model for 6 insect pests is 100%, which is 3.90% higher than that of the original ResNet18 model. CONCLUSION: This research provides not only useful references for alleviating the overfitting problem in deep learning, but also a theoretical and technical support for the intelligent detection and control of plant diseases and pests. © 2024 Society of Chemical Industry.


Asunto(s)
Aprendizaje Profundo , Redes Neurales de la Computación , Enfermedades de las Plantas , Máquina de Vectores de Soporte , Enfermedades de las Plantas/parasitología , Enfermedades de las Plantas/prevención & control , Animales , Insectos , Control de Plagas/métodos
3.
Empir Softw Eng ; 29(5): 116, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39069998

RESUMEN

Previous studies have shown that Automated Program Repair (apr) techniques suffer from the overfitting problem. Overfitting happens when a patch is run and the test suite does not reveal any error, but the patch actually does not fix the underlying bug or it introduces a new defect that is not covered by the test suite. Therefore, the patches generated by apr tools need to be validated by human programmers, which can be very costly, and prevents apr tool adoption in practice. Our work aims to minimize the number of plausible patches that programmers have to review, thereby reducing the time required to find a correct patch. We introduce a novel light-weight test-based patch clustering approach called xTestCluster, which clusters patches based on their dynamic behavior. xTestCluster is applied after the patch generation phase in order to analyze the generated patches from one or more repair tools and to provide more information about those patches for facilitating patch assessment. The novelty of xTestCluster lies in using information from execution of newly generated test cases to cluster patches generated by multiple APR approaches. A cluster is formed of patches that fail on the same generated test cases. The output from xTestCluster gives developers a) a way of reducing the number of patches to analyze, as they can focus on analyzing a sample of patches from each cluster, b) additional information (new test cases and their results) attached to each patch. After analyzing 902 plausible patches from 21 Java apr tools, our results show that xTestCluster is able to reduce the number of patches to review and analyze with a median of 50%. xTestCluster can save a significant amount of time for developers that have to review the multitude of patches generated by apr tools, and provides them with new test cases that expose the differences in behavior between generated patches. Moreover, xTestCluster can complement other patch assessment techniques that help detect patch misclassifications.

4.
Metabolomics ; 20(1): 8, 2023 Dec 21.
Artículo en Inglés | MEDLINE | ID: mdl-38127222

RESUMEN

INTRODUCTION: In general, two characteristics are ever present in NMR-based metabolomics studies: (1) they are assays aiming to classify the samples in different groups, and (2) the number of samples is smaller than the feature (chemical shift) number. It is also common to observe imbalanced datasets due to the sampling method and/or inclusion criteria. These situations can cause overfitting. However, appropriate feature selection and classification methods can be useful to solve this issue. OBJECTIVES: Investigate the performance of metabolomics models built from the association between feature selectors, the absence of feature selection, and classification algorithms, as well as use the best performance model as an NMR-based metabolomic method for prostate cancer diagnosis. METHODS: We evaluated the performance of NMR-based metabolomics models for prostate cancer diagnosis using seven feature selectors and five classification formalisms. We also obtained metabolomics models without feature selection. In this study, thirty-eight volunteers with a positive diagnosis of prostate cancer and twenty-three healthy volunteers were enrolled. RESULTS: Thirty-eight models obtained were evaluated using AUROC, accuracy, sensitivity, specificity, and kappa's index values. The best result was obtained when Genetic Algorithm was used with Linear Discriminant Analysis with 0.92 sensitivity, 0.83 specificity, and 0.88 accuracy. CONCLUSION: The results show that the pick of a proper feature selection method and classification model, and a resampling method can avoid overfitting in a small metabolomic dataset. Furthermore, this approach would decrease the number of biopsies and optimize patient follow-up. 1H NMR-based metabolomics promises to be a non-invasive tool in prostate cancer diagnosis.


Asunto(s)
Quimiometría , Neoplasias de la Próstata , Masculino , Humanos , Metabolómica , Neoplasias de la Próstata/diagnóstico , Imagen por Resonancia Magnética , Algoritmos
5.
Proc Natl Acad Sci U S A ; 117(48): 30063-30070, 2020 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-32332161

RESUMEN

The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finite-dimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lie in an infinite-dimensional space vs. when the data lie in a finite-dimensional space with dimension that grows faster than the sample size.

6.
Multivariate Behav Res ; 58(4): 723-742, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36223076

RESUMEN

Nonlinear mixed-effects models (NLMEMs) allow researchers to model curvilinear patterns of growth, but there is ambiguity as to what functional form the data follow. Often, researchers fit multiple nonlinear functions to data and use model selection criteria to decide which functional form fits the data "best." Frequently used model selection criteria only account for the number of parameters in a model but overlook the complexity of intrinsically nonlinear functional forms. This can lead to overfitting and hinder the generalizability and reproducibility of results. The primary goal of this study was to evaluate the performance of eight model selection criteria via a Monte Carlo simulation study and assess under what conditions these criteria are sensitive to model overfitting as it relates to functional form complexity. Results highlighted criteria with the potential to capture overfitting for intrinsically nonlinear functional forms for NLMEMs. Information criteria and the stochastic information complexity criterion recovered the true model more often than the average or conditional concordance correlation. Results also suggest that the amount of residual variance and sample size have an impact on model selection for NLMEMs. Implications for future research and recommendations for application are also provided.

7.
Sensors (Basel) ; 23(23)2023 Nov 27.
Artículo en Inglés | MEDLINE | ID: mdl-38067822

RESUMEN

For a fiber optic gyroscope, thermal deformation of the fiber coil can introduce additional thermal-induced phase errors, commonly referred to as thermal errors. Implementing effective thermal error compensation techniques is crucial to addressing this issue. These techniques operate based on the real-time sensing of thermal errors and subsequent correction within the output signal. Given the challenge of directly isolating thermal errors from the gyroscope's output signal, predicting thermal errors based on temperature becomes necessary. To establish a mathematical model correlating the temperature and thermal errors, this study measured synchronized data of phase errors and angular velocity for the fiber coil under various temperature conditions, aiming to model it using data-driven methods. However, due to the difficulty of conducting tests and the limited number of data samples, direct engagement in data-driven modeling poses a risk of severe overfitting. To overcome this challenge, we propose a modeling algorithm that effectively integrates theoretical models with data, referred to as the TD-model in this paper. Initially, a theoretical analysis of the phase errors caused by thermal deformation of the fiber coil is performed. Subsequently, critical parameters, such as the thermal expansion coefficient, are determined, leading to the establishment of a theoretical model. Finally, the theoretical analysis model is incorporated as a regularization term and combined with the test data to jointly participate in the regression of model coefficients. Through experimental comparative analysis, it is shown that, relative to ordinary regression models, the TD-model effectively mitigates overfitting caused by the limited number of samples, resulting in a substantial 58% improvement in predictive accuracy.

8.
Sensors (Basel) ; 23(17)2023 Aug 23.
Artículo en Inglés | MEDLINE | ID: mdl-37687802

RESUMEN

Temperature sensors are widely used in industrial production and scientific research, and accurate temperature measurement is crucial for ensuring the quality and safety of production processes. To improve the accuracy and stability of temperature sensors, this paper proposed using an artificial neural network (ANN) model for calibration and explored the feasibility and effectiveness of using ANNs to calibrate temperature sensors. The experiment collected multiple sets of temperature data from standard temperature sensors in different environments and compared the calibration results of the ANN model, linear regression, and polynomial regression. The experimental results show that calibration using the ANN improved the accuracy of the temperature sensors. Compared with traditional linear regression and polynomial regression, the ANN model produced more accurate calibration. However, overfitting may occur due to a small sample size or a large amount of noise. Therefore, the key to improving calibration using the ANN model is to design reasonable training samples and adjust the model parameters. The results of this study are important for practical applications and provide reliable technical support for industrial production and scientific research.

9.
Knowl Inf Syst ; 65(5): 2017-2042, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36683607

RESUMEN

An obvious defect of extreme learning machine (ELM) is that its prediction performance is sensitive to the random initialization of input-layer weights and hidden-layer biases. To make ELM insensitive to random initialization, GPRELM adopts the simple an effective strategy of integrating Gaussian process regression into ELM. However, there is a serious overfitting problem in kernel-based GPRELM (kGPRELM). In this paper, we investigate the theoretical reasons for the overfitting of kGPRELM and further propose a correlation-based GPRELM (cGPRELM), which uses a correlation coefficient to measure the similarity between two different hidden-layer output vectors. cGPRELM reduces the likelihood that the covariance matrix becomes an identity matrix when the number of hidden-layer nodes is increased, effectively controlling overfitting. Furthermore, cGPRELM works well for improper initialization intervals where ELM and kGPRELM fail to provide good predictions. The experimental results on real classification and regression data sets demonstrate the feasibility and superiority of cGPRELM, as it not only achieves better generalization performance but also has a lower computational complexity.

10.
Entropy (Basel) ; 25(3)2023 Mar 16.
Artículo en Inglés | MEDLINE | ID: mdl-36981400

RESUMEN

Takeuchi's Information Criterion (TIC) was introduced as a generalization of Akaike's Information Criterion (AIC) in 1976. Though TIC avoids many of AIC's strict requirements and assumptions, it is only rarely used. One of the reasons for this is that the trace term introduced in TIC is numerically unstable and computationally expensive to compute. An extension of TIC called ICE was published in 2021, which allows this trace term to be used for model fitting (where it was primarily compared to L2 regularization) instead of just model selection. That paper also examined numerically stable and computationally efficient approximations that could be applied to TIC or ICE, but these approximations were only examined on small synthetic models. This paper applies and extends these approximations to larger models on real datasets for both TIC and ICE. This work shows the practical models may use TIC and ICE in a numerically stable way to achieve superior results at a reasonable computational cost.

11.
J Proteome Res ; 21(9): 2071-2074, 2022 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-36004690

RESUMEN

This review "teaches" researchers how to make their lackluster proteomics data look really impressive, by applying an inappropriate but pervasive strategy that selects features in a biased manner. The strategy is demonstrated and used to build a classification model with an accuracy of 92% and AUC of 0.98, while relying completely on random numbers for the data set. This "lesson" in data processing is not to be practiced by anyone; on the contrary, it is meant to be a cautionary tale showing that very unreliable results are obtained when a biomarker panel is generated first, using all the available data, and then tested by cross-validation. Data scientists describe the error committed in this scenario as having test data leak into the feature selection step, and it is currently a common mistake in proteomics biomarker studies that rely on machine learning. After the demonstration, advice is provided about how machine learning methods can be applied to proteomics data sets without generating artificially inflated accuracies.


Asunto(s)
Aprendizaje Automático , Proteómica , Biomarcadores , Proteómica/métodos
12.
Biotechnol Bioeng ; 119(9): 2423-2436, 2022 09.
Artículo en Inglés | MEDLINE | ID: mdl-35680641

RESUMEN

A coculture of Syntrophobacter fumaroxidans and Methanospirillum hungatei was modeled using four biokinetic models, which only differed by the functions used to describe the growth yields (dynamic or constant) and the hydrogen inhibition function (noncompetitive or based on thermodynamics). First, a batch experiment was used to train the model and analyze the fitted parameters. Two fitting procedures were followed by minimizing the error on different indicators. Second, a chemostat experiment was used as a test data set to assess the predictive power of the models. Overall, the four models were able to accurately fit the train data set following both fitting procedures. However, some parameters fitted with the ADM1-like model differed significantly from values reported in the literature and were dependent on the fitting procedure. When applied to the test data set it systematically resulted in positive Gibbs free energy changes values for propionate oxidation, in contradiction with the second law of thermodynamics. On the opposite, the parameters fitted with model including both a thermodynamic-based inhibition function and a dynamic computation of growth yields were more consistent with values reported in the literature and repeatable whatever the fitting procedure. The results highlight the potential of implementing thermodynamic-based functions in biokinetic models.


Asunto(s)
Methanospirillum , Propionatos , Anaerobiosis , Técnicas de Cocultivo , Metano , Oxidación-Reducción , Termodinámica
13.
Proc Natl Acad Sci U S A ; 116(24): 11624-11629, 2019 06 11.
Artículo en Inglés | MEDLINE | ID: mdl-31127041

RESUMEN

Deep neural networks have achieved state-of-the-art accuracy at classifying molecules with respect to whether they bind to specific protein targets. A key breakthrough would occur if these models could reveal the fragment pharmacophores that are causally involved in binding. Extracting chemical details of binding from the networks could enable scientific discoveries about the mechanisms of drug actions. However, doing so requires shining light into the black box that is the trained neural network model, a task that has proved difficult across many domains. Here we show how the binding mechanism learned by deep neural network models can be interrogated, using a recently described attribution method. We first work with carefully constructed synthetic datasets, in which the molecular features responsible for "binding" are fully known. We find that networks that achieve perfect accuracy on held-out test datasets still learn spurious correlations, and we are able to exploit this nonrobustness to construct adversarial examples that fool the model. This makes these models unreliable for accurately revealing information about the mechanisms of protein-ligand binding. In light of our findings, we prescribe a test that checks whether a hypothesized mechanism can be learned. If the test fails, it indicates that the model must be simplified or regularized and/or that the training dataset requires augmentation.


Asunto(s)
Unión Proteica/fisiología , Proteínas/química , Algoritmos , Ligandos , Aprendizaje Automático , Modelos Químicos , Redes Neurales de la Computación
14.
BMC Bioinformatics ; 22(Suppl 5): 84, 2021 Nov 08.
Artículo en Inglés | MEDLINE | ID: mdl-34749634

RESUMEN

BACKGROUND: Doctors can detect symptoms of diabetic retinopathy (DR) early by using retinal ophthalmoscopy, and they can improve diagnostic efficiency with the assistance of deep learning to select treatments and support personnel workflow. Conventionally, most deep learning methods for DR diagnosis categorize retinal ophthalmoscopy images into training and validation data sets according to the 80/20 rule, and they use the synthetic minority oversampling technique (SMOTE) in data processing (e.g., rotating, scaling, and translating training images) to increase the number of training samples. Oversampling training may lead to overfitting of the training model. Therefore, untrained or unverified images can yield erroneous predictions. Although the accuracy of prediction results is 90%-99%, this overfitting of training data may distort training module variables. RESULTS: This study uses a 2-stage training method to solve the overfitting problem. In the training phase, to build the model, the Learning module 1 used to identify the DR and no-DR. The Learning module 2 on SMOTE synthetic datasets to identify the mild-NPDR, moderate NPDR, severe NPDR and proliferative DR classification. These two modules also used early stopping and data dividing methods to reduce overfitting by oversampling. In the test phase, we use the DIARETDB0, DIARETDB1, eOphtha, MESSIDOR, and DRIVE datasets to evaluate the performance of the training network. The prediction accuracy achieved to 85.38%, 84.27%, 85.75%, 86.73%, and 92.5%. CONCLUSIONS: Based on the experiment, a general deep learning model for detecting DR was developed, and it could be used with all DR databases. We provided a simple method of addressing the imbalance of DR databases, and this method can be used with other medical images.


Asunto(s)
Aprendizaje Profundo , Diabetes Mellitus , Retinopatía Diabética , Bases de Datos Factuales , Retinopatía Diabética/diagnóstico , Humanos , Retina
15.
Magn Reson Med ; 86(2): 1110-1124, 2021 08.
Artículo en Inglés | MEDLINE | ID: mdl-33768579

RESUMEN

PURPOSE: Diffusional kurtosis metrics show high performance for detecting pathological changes and are therefore expected to be disease biomarkers. Kurtosis maps, however, tend to be noisy. The maps' visual quality is crucial for disease diagnosis, even when kurtosis is being used quantitatively. A Bayesian method was proposed to curtail the large statistical error inherent in kurtosis estimation while maintaining potential application to biomarkers. THEORY: Gaussian priors are determined from first-step estimations implemented using the least-square method (LSM). The likelihood-function variance is determined from the residuals of the estimation. Although the proposed approach is similar to a regularized LSM, regularization parameters do not have to be artificially adjusted. An appropriate balance between denoising and preventing false shrinkages of metric dispersions is automatically achieved. METHODS: Map qualities achieved using the conventional and proposed methods were compared. The receiver-operating characteristic analysis was performed for glioma-grade differentiation using simulated low- and high-grade glioma DWI datasets. Noninferiority of the proposed method was tested for areas under the curves (AUCs). RESULTS: The noisier the conventional maps, the better the proposed Bayesian method improved them. Noninferiority of the proposed method was confirmed by AUC tests for all kurtosis-related metrics. Superiority of the proposed method was also established for several metrics. CONCLUSIONS: The proposed approach improved noisy kurtosis maps while maintaining their performances as biomarkers without increasing data acquisition requirements or arbitrarily choosing LSM regularization parameters. This approach may enable the use of higher-order terms in diffusional kurtosis imaging (DKI) fitting functions by suppressing overfitting, thereby improving the DKI-estimation accuracy.


Asunto(s)
Imagen de Difusión por Resonancia Magnética , Glioma , Teorema de Bayes , Imagen de Difusión Tensora , Glioma/diagnóstico por imagen , Humanos , Curva ROC
16.
Entropy (Basel) ; 23(11)2021 Oct 28.
Artículo en Inglés | MEDLINE | ID: mdl-34828117

RESUMEN

Modern computational models in supervised machine learning are often highly parameterized universal approximators. As such, the value of the parameters is unimportant, and only the out of sample performance is considered. On the other hand much of the literature on model estimation assumes that the parameters themselves have intrinsic value, and thus is concerned with bias and variance of parameter estimates, which may not have any simple relationship to out of sample model performance. Therefore, within supervised machine learning, heavy use is made of ridge regression (i.e., L2 regularization), which requires the the estimation of hyperparameters and can be rendered ineffective by certain model parameterizations. We introduce an objective function which we refer to as Information-Corrected Estimation (ICE) that reduces KL divergence based generalization error for supervised machine learning. ICE attempts to directly maximize a corrected likelihood function as an estimator of the KL divergence. Such an approach is proven, theoretically, to be effective for a wide class of models, with only mild regularity restrictions. Under finite sample sizes, this corrected estimation procedure is shown experimentally to lead to significant reduction in generalization error compared to maximum likelihood estimation and L2 regularization.

17.
Entropy (Basel) ; 23(9)2021 Sep 11.
Artículo en Inglés | MEDLINE | ID: mdl-34573827

RESUMEN

Model selection criteria are widely used to identify the model that best represents the data among a set of potential candidates. Amidst the different model selection criteria, the Bayesian information criterion (BIC) and the Akaike information criterion (AIC) are the most popular and better understood. In the derivation of these indicators, it was assumed that the model's dependent variables have already been properly identified and that the entries are not affected by significant uncertainties. These are issues that can become quite serious when investigating complex systems, especially when variables are highly correlated and the measurement uncertainties associated with them are not negligible. More sophisticated versions of this criteria, capable of better detecting spurious relations between variables when non-negligible noise is present, are proposed in this paper. Their derivation is obtained starting from a Bayesian statistics framework and adding an a priori Chi-squared probability distribution function of the model, dependent on a specifically defined information theoretic quantity that takes into account the redundancy between the dependent variables. The performances of the proposed versions of these criteria are assessed through a series of systematic simulations, using synthetic data for various classes of functions and noise levels. The results show that the upgraded formulation of the criteria clearly outperforms the traditional ones in most of the cases reported.

18.
J Struct Biol ; 211(2): 107545, 2020 08 01.
Artículo en Inglés | MEDLINE | ID: mdl-32534144

RESUMEN

Single particle analysis has become a key structural biology technique. Experimental images are extremely noisy, and during iterative refinement it is possible to stably incorporate noise into the reconstruction. Such "over-fitting" can lead to misinterpretation of the structure and flawed biological results. Several strategies are routinely used to prevent over-fitting, the most common being independent refinement of two sides of a split dataset. In this study, we show that over-fitting remains an issue within regions of low local signal-to-noise, despite independent refinement of half datasets. We propose a modification of the refinement process through the application of a local signal-to-noise filter: SIDESPLITTER. We show that our approach can reduce over-fitting for both idealised and experimental data while maintaining independence between the two sides of a split refinement. SIDESPLITTER refinement leads to improved density, and can also lead to improvement of the final resolution in extreme cases where datasets are prone to severe over-fitting, such as small membrane proteins.


Asunto(s)
Imagenología Tridimensional , Proteínas de la Membrana/ultraestructura , Modelos Moleculares , Imagen Individual de Molécula/métodos , Algoritmos , Proteínas de la Membrana/química , Relación Señal-Ruido , Programas Informáticos
19.
J Biomed Inform ; 105: 103408, 2020 05.
Artículo en Inglés | MEDLINE | ID: mdl-32173502

RESUMEN

Limited sample sizes can lead to spurious modeling findings in biomedical research. The objective of this work is to present a new method to generate synthetic populations (SPs) from limited samples using matched case-control data (n = 180 pairs), considered as two separate limited samples. SPs were generated with multivariate kernel density estimations (KDEs) with unconstrained bandwidth matrices. We included four continuous variables and one categorical variable for each individual. Bandwidth matrices were determined with Differential Evolution (DE) optimization by covariance comparisons. Four synthetic samples (n = 180) were derived from their respective SPs. Similarity between observed samples with synthetic samples was compared assuming their empirical probability density functions (EPDFs) were similar. EPDFs were compared with the maximum mean discrepancy (MMD) test statistic based on the Kernel Two-Sample Test. To evaluate similarity within a modeling context, EPDFs derived from the Principal Component Analysis (PCA) scores and residuals were summarized with the distance to the model in X-space (DModX) as additional comparisons. Four SPs were generated from each sample. The probability of selecting a replicate when randomly constructing synthetic samples (n = 180) was infinitesimally small. MMD tests indicated that the observed sample EPDFs were similar to the respective synthetic EPDFs. For the samples, PCA scores and residuals did not deviate significantly when compared with their respective synthetic samples. The feasibility of this approach was demonstrated by producing synthetic data at the individual level, statistically similar to the observed samples. The methodology coupled KDE with DE optimization and deployed novel similarity metrics derived from PCA. This approach could be used to generate larger-sized synthetic samples. To develop this approach into a research tool for data exploration purposes, additional evaluation with increased dimensionality is required. Moreover, given a fully specified population, the degree to which individuals can be discarded while synthesizing the respective population accurately will be investigated. When these objectives are addressed, comparisons with other techniques such as bootstrapping will be required for a complete evaluation.


Asunto(s)
Proyectos de Investigación , Estudios de Casos y Controles , Humanos , Análisis de Componente Principal , Tamaño de la Muestra
20.
BMC Med ; 17(1): 230, 2019 12 16.
Artículo en Inglés | MEDLINE | ID: mdl-31842878

RESUMEN

BACKGROUND: The assessment of calibration performance of risk prediction models based on regression or more flexible machine learning algorithms receives little attention. MAIN TEXT: Herein, we argue that this needs to change immediately because poorly calibrated algorithms can be misleading and potentially harmful for clinical decision-making. We summarize how to avoid poor calibration at algorithm development and how to assess calibration at algorithm validation, emphasizing balance between model complexity and the available sample size. At external validation, calibration curves require sufficiently large samples. Algorithm updating should be considered for appropriate support of clinical practice. CONCLUSION: Efforts are required to avoid poor calibration when developing prediction models, to evaluate calibration when validating models, and to update models when indicated. The ultimate aim is to optimize the utility of predictive analytics for shared decision-making and patient counseling.


Asunto(s)
Calibración/normas , Aprendizaje Automático/normas , Valor Predictivo de las Pruebas , Adulto , Anciano , Algoritmos , Humanos , Masculino , Persona de Mediana Edad
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA