RESUMEN
MOTIVATION: Automated machine learning (AutoML) solutions can bridge the gap between new computational advances and their real-world applications by enabling experimental scientists to build their own custom models. We examine different steps in the development life-cycle of peptide bioactivity binary predictors and identify key steps where automation cannot only result in a more accessible method, but also more robust and interpretable evaluation leading to more trustworthy models. RESULTS: We present a new automated method for drawing negative peptides that achieves better balance between specificity and generalization than current alternatives. We study the effect of homology-based partitioning for generating the training and testing data subsets and demonstrate that model performance is overestimated when no such homology correction is used, which indicates that prior studies may have overestimated their performance when applied to new peptide sequences. We also conduct a systematic analysis of different protein language models as peptide representation methods and find that they can serve as better descriptors than a naive alternative, but that there is no significant difference across models with different sizes or algorithms. Finally, we demonstrate that an ensemble of optimized traditional machine learning algorithms can compete with more complex neural network models, while being more computationally efficient. We integrate these findings into AutoPeptideML, an easy-to-use AutoML tool to allow researchers without a computational background to build new predictive models for peptide bioactivity in a matter of minutes. AVAILABILITY AND IMPLEMENTATION: Source code, documentation, and data are available at https://github.com/IBM/AutoPeptideML and a dedicated web-server at http://peptide.ucd.ie/AutoPeptideML. A static version of the software to ensure the reproduction of the results is available at https://zenodo.org/records/13363975.
Asunto(s)
Algoritmos , Aprendizaje Automático , Péptidos , Péptidos/química , Programas Informáticos , Biología Computacional/métodos , Redes Neurales de la Computación , Bases de Datos de ProteínasRESUMEN
To protect vital health program funds from being paid out on services that are wasteful and inconsistent with medical practices, government healthcare insurance programs need to validate the integrity of claims submitted by providers for reimbursement. However, due the complexity of healthcare billing policies and the lack of coded rules, maintaining "integrity" is a labor-intensive task, often narrow-scope and expensive. We propose an approach that combines deep learning and an ontology to support the extraction of actionable knowledge on benefit rules from regulatory healthcare policy text. We demonstrate its feasibility even in the presence of small ground truth labeled data provided by policy investigators. Leveraging deep learning and rich ontological information enables the system to learn from human corrections and capture better benefit rules from policy text, beyond just using a deterministic approach based on pre-defined textual and semantic pattterns.