RESUMEN
BACKGROUND: Classification of binary data arises naturally in many clinical applications, such as patient risk stratification through ICD codes. One of the key practical challenges in data classification using machine learning is to avoid overfitting. Overfitting in supervised learning primarily occurs when a model learns random variations from noisy labels in training data rather than the underlying patterns. While traditional methods such as regularization and early stopping have demonstrated effectiveness in interpolation tasks, addressing overfitting in the classification of binary data, in which predictions always amount to extrapolation, demands extrapolation-enhanced strategies. One such approach is hybrid mechanistic/data-driven modeling, which integrates prior knowledge on input features into the learning process, enhancing the model's ability to extrapolate. RESULTS: We present NoiseCut, a Python package for noise-tolerant classification of binary data by employing a hybrid modeling approach that leverages solutions of defined max-cut problems. In a comparative analysis conducted on synthetically generated binary datasets, NoiseCut exhibits better overfitting prevention compared to the early stopping technique employed by different supervised machine learning algorithms. The noise tolerance of NoiseCut stems from a dropout strategy that leverages prior knowledge of input features and is further enhanced by the integration of max-cut problems into the learning process. CONCLUSIONS: NoiseCut is a Python package for the implementation of hybrid modeling for the classification of binary data. It facilitates the integration of mechanistic knowledge on the input features into learning from data in a structured manner and proves to be a valuable classification tool when the available training data is noisy and/or limited in size. This advantage is especially prominent in medical and biomedical applications where data scarcity and noise are common challenges. The codebase, illustrations, and documentation for NoiseCut are accessible for download at https://pypi.org/project/noisecut/ . The implementation detailed in this paper corresponds to the version 0.2.1 release of the software.
Asunto(s)
Algoritmos , Programas Informáticos , Humanos , Aprendizaje Automático Supervisado , Aprendizaje AutomáticoRESUMEN
Itaconic acid is a platform chemical with a range of applications in polymer synthesis and is also discussed for biofuel production. While produced in industry from glucose or sucrose, co-feeding of glucose and acetate was recently discussed to increase itaconic acid production by the smut fungus Ustilago maydis. In this study, we investigate the optimal co-feeding conditions by interlocking experimental and computational methods. Flux balance analysis indicates that acetate improves the itaconic acid yield up to a share of 40% acetate on a carbon molar basis. A design of experiment results in the maximum yield of 0.14 itaconic acid per carbon source from 100 g L - 1 $\,\text{g L}{}^{-1}$ glucose and 12 g L - 1 $\,\text{g L}{}^{-1}$ acetate. The yield is improved by around 22% when compared to feeding of glucose as sole carbon source. To further improve the yield, gene deletion targets are discussed that were identified using the metabolic optimization tool OptKnock. The study contributes ideas to reduce land use for biotechnology by incorporating acetate as co-substrate, a C2-carbon source that is potentially derived from carbon dioxide.
Asunto(s)
Glucosa , Modelos Biológicos , Succinatos , Glucosa/metabolismo , Succinatos/metabolismo , Ustilago/metabolismo , Ustilago/genética , BasidiomycotaRESUMEN
The biotechnological production of methyl ketones is a sustainable alternative to fossil-derived chemical production. To date, the best host for microbial production of methyl ketones is a genetically engineered Pseudomonas taiwanensis VLB120 ∆6 pProd strain, achieving yields of 101 mgg-1 on glucose in batch cultivations. For competitiveness with the petrochemical production pathway, however, higher yields are necessary. Co-feeding can improve the yield by fitting the carbon-to-energy ratio to the organism and the target product. In this work, we developed co-feeding strategies for P. taiwanensis VLB120 ∆6 pProd by combined metabolic modeling and experimental work. In a first step, we conducted flux balance analysis with an expanded genome-scale metabolic model of iJN1463 and found ethanol as the most promising among five cosubstrates. Next, we performed cultivations with ethanol and found the highest reported yield in batch production of methyl ketones with P. taiwanensis VLB120 to date, namely, 154 mg g-1 methyl ketones. However, ethanol is toxic to the cell, which reflects in a lower substrate consumption and lower product concentrations when compared to production on glucose. Hence, we propose cofeeding ethanol with glucose and find that, indeed, higher concentrations than in ethanol-fed cultivation (0.84 g Laq-1 with glucose and ethanol as opposed to 0.48 g Laq-1 with only ethanol) were achieved, with a yield of 85 mg g-1. In a last step, comparing experimental with computational results suggested the potential for improving the methyl ketone yield by fed-batch cultivation, in which cell growth and methyl ketone production are separated into two phases employing optimal ethanol to glucose ratios. ONE-SENTENCE SUMMARY: By combining computational and experimental work, we demonstrate that feeding ethanol in addition to glucose improves the yield of biotechnologically produced methyl ketones.
Asunto(s)
Acetona , Biotecnología , Carbono , Etanol , GlucosaRESUMEN
Several mathematical models to predict tumor growth over time have been developed in the last decades. A central aspect of such models is the interaction of tumor cells with immune effector cells. The Kuznetsov model (Kuznetsov et al. in Bull Math Biol 56(2):295-321, 1994) is the most prominent of these models and has been used as a basis for many other related models and theoretical studies. However, none of these models have been validated with large-scale real-world data of human patients treated with cancer immunotherapy. In addition, parameter estimation of these models remains a major bottleneck on the way to model-based and data-driven medical treatment. In this study, we quantitatively fit Kuznetsov's model to a large dataset of 1472 patients, of which 210 patients have more than six data points, by estimating the model parameters of each patient individually. We also conduct a global practical identifiability analysis for the estimated parameters. We thus demonstrate that several combinations of parameter values could lead to accurate data fitting. This opens the potential for global parameter estimation of the model, in which the values of all or some parameters are fixed for all patients. Furthermore, by omitting the last two or three data points, we show that the model can be extrapolated and predict future tumor dynamics. This paves the way for a more clinically relevant application of mathematical tumor modeling, in which the treatment strategy could be adjusted in advance according to the model's future predictions.
Asunto(s)
Conceptos Matemáticos , Neoplasias , Recuento de Células , Humanos , Inmunoterapia , Modelos Biológicos , Neoplasias/terapiaRESUMEN
SUMMARY: The molecular changes induced by perturbations such as drugs and ligands are highly informative of the intracellular wiring. Our capacity to generate large datasets is increasing steadily. A useful way to extract mechanistic insight from the data is by integrating them with a prior knowledge network of signalling to obtain dynamic models. CellNOpt is a collection of Bioconductor R packages for building logic models from perturbation data and prior knowledge of signalling networks. We have recently developed new components and refined the existing ones to keep up with the computational demand of increasingly large datasets, including (i) an efficient integer linear programming, (ii) a probabilistic logic implementation for semi-quantitative datasets, (iii) the integration of a stochastic Boolean simulator, (iv) a tool to identify missing links, (v) systematic post-hoc analyses and (vi) an R-Shiny tool to run CellNOpt interactively. AVAILABILITY AND IMPLEMENTATION: R-package(s): https://github.com/saezlab/cellnopt. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Transducción de Señal , Programas Informáticos , LógicaRESUMEN
A biorefinery comprises a variety of process steps to synthesize products from sustainable natural resources. Dynamic plant-wide simulation enhances the process understanding, leads to improved cost efficiency and enables model-based operation and control. It is thereby important for an increased competitiveness to conventional processes. To this end, we developed a Modelica library with replaceable building blocks that allow dynamic modeling of an entire biorefinery. For the microbial conversion step, we built on the dynamic flux balance analysis (DFBA) approach to formulate process models for the simulation of cellular metabolism under changing environmental conditions. The resulting system of differential-algebraic equations with embedded optimization criteria (DAEO) is solved by a tailor-made toolbox. In summary, our modeling framework comprises three major pillars: A Modelica library of dynamic unit operations, an easy-to-use interface to formulate DFBA process models and a DAEO toolbox that allows simulation with standard environments based on the Modelica modeling language. A biorefinery model for dynamic simulation of the OrganoCat pretreatment process and microbial conversion of the resulting feedstock by Corynebacterium glutamicum serves as case study to demonstrate its practical relevance.
Asunto(s)
Simulación por Computador , Corynebacterium glutamicum/crecimiento & desarrollo , Modelos BiológicosRESUMEN
Correction for 'Conceptual design and analysis of ITM oxy-combustion power cycles' by N. D. Mancini et al., Phys. Chem. Chem. Phys., 2011, 13, 21351-21361.
RESUMEN
We propose excess Gibbs free energy graph neural networks (GE-GNNs) for predicting composition-dependent activity coefficients of binary mixtures. The GE-GNN architecture ensures thermodynamic consistency by predicting the molar excess Gibbs free energy and using thermodynamic relations to obtain activity coefficients. As these are differential, automatic differentiation is applied to learn the activity coefficients in an end-to-end manner. Since the architecture is based on fundamental thermodynamics, we do not require additional loss terms to learn thermodynamic consistency. As the output is a fundamental property, we neither impose thermodynamic modeling limitations and assumptions. We demonstrate high accuracy and thermodynamic consistency of the activity coefficient predictions.
RESUMEN
The energy transition is a multinational challenge to mitigate climate change, with a joint reduction target for greenhouse gas emissions. Simultaneously, each country is interested in minimizing its own energy supply cost. Still, most energy system models neglect national interests when identifying cost-optimal transition pathways. We design the European energy system transition until 2050, considering competition between countries in a shared electricity and carbon market using bilevel optimization. We find that national objectives substantially impact the transition pathway: Compared to the model solved using the common centralized optimization, the overall installed capacity increases by just 3% when including national interests. However, the distribution of the installed capacity changes dramatically by more than 40% in most countries. Our results underline the risk of miscalculating the need for national capacity expansion when neglecting stakeholder representation in energy system models and demonstrate the need for cooperation for an efficient energy transition.
RESUMEN
The critical micelle concentration (CMC) of surfactant molecules is an essential property for surfactant applications in the industry. Recently, classical quantitative structure-property relationship (QSPR) and graph neural networks (GNNs), a deep learning technique, have been successfully applied to predict the CMC of surfactants at room temperature. However, these models have not yet considered the temperature dependence of the CMC, which is highly relevant to practical applications. We herein develop a GNN model for the temperature-dependent CMC prediction of surfactants. We collected about 1400 data points from public sources for all surfactant classes, i.e., ionic, nonionic, and zwitterionic, at multiple temperatures. We test the predictive quality of the model for the following scenarios: (i) when CMC data for surfactants are present in the training of the model in at least one different temperature and (ii) CMC data for surfactants are not present in the training, i.e., generalizing to unseen surfactants. In both test scenarios, our model exhibits a high predictive performance of R2 ≥ 0.95 on test data. We also find that the model performance varies with the surfactant class. Finally, we evaluate the model for sugar-based surfactants with complex molecular structures, as these represent a more sustainable alternative to synthetic surfactants and are therefore of great interest for future applications in the personal and home care industries.
RESUMEN
The use of adsorption for the purification of dicarboxylic acids is rather limited and currently predominantly confined to ion-exchange chromatography. A promising, but less regarded alternative is the use of hydrophobic adsorbents. Regarding hydrophobic absorbents, the literature focuses on screenings of adsorbents for purification of (di)carboxylic acids with regard to adsorption equilibria. The investigation of dynamic phenomena in the column received only minor attention. In this contribution, this knowledge gap is addressed. First, the adsorption behavior of itaconic acid species on the hydrophobic, highly-crosslinked polymeric adsorbent Chromalite™ PCG1200C is investigated. For this purpose, adsorption isotherms are determined via frontal analysis at pH values of 2, 3, 4.5, 6.5, and 8 to evaluate the dependency of the adsorption capacity on the dissociation state. Capacities above 150 g Lads-1 at liquid phase concentrations of 70 g L-1 are observed at a pH of 2. A strong decrease of capacity with increasing pH value, i.e., with increasing fraction of dissociated negatively charged acid species, is observed. Second, pulse experiments at aforementioned pH values are performed. Thereby, in-line Raman spectra are recorded at the column outlet, which allows the direct differentiation of the acid species state of dissociation. The spectral information is evaluated for quantitative concentration profiles of itaconic acid species using Indirect Hard Modeling with mixture hard models that are calibrated subject to ideal as well as non-ideal thermodynamics. In-line measurement errors of ≤ 3.5 g L-1 are achieved for the itaconic acid species. In dependency of the pH of the feed solution, a separation of the individual acid species within the pulse experiments is observed. It is conjectured that the process is dominated by a superposition of species-dependent adsorption characteristics and dissociation reactions.
Asunto(s)
Espectrometría Raman , Succinatos , Adsorción , Concentración de Iones de Hidrógeno , Interacciones Hidrofóbicas e Hidrofílicas , Cinética , Polímeros , TermodinámicaRESUMEN
We propose an approach for monitoring the concentration of dissociated carboxylic acid species in dilute aqueous solution. The dissociated acid species are quantified employing inline Raman spectroscopy in combination with indirect hard modeling (IHM) and multivariate curve resolution (MCR). We introduce two different titration-based hard model (HM) calibration procedures for a single mono- or polyprotic acid in water with well-known (method A) or unknown (method B) acid dissociation constants pKa. In both methods, spectra of only one acid species in water are prepared for each acid species. These spectra are used for the construction of HMs. For method A, the HMs are calibrated with calculated ideal dissociation equilibria. For method B, we estimate pKa values by fitting ideal acid dissociation equilibria to acid peak areas that are obtained from a spectral HM. The HM in turn is constructed on the basis of MCR data. Thus, method B on the basis of IHM is independent of a priori known pKa values, but instead provides them as part of the calibration procedure. As a detailed example, we analyze itaconic acid in aqueous solution. For all acid species and water, we obtain low HM errors of < 2.87 × 10-4mol mol-1 in the cases of both methods A and B. With only four calibration samples, IHM yields more accurate results than partial least squares regression. Furthermore, we apply our approach to formic, acetic, and citric acid in water, thereby verifying its generalizability as a process analytical technology for quantitative monitoring of processes containing carboxylic acids.
RESUMEN
Metabolic engineering relies on modifying gene expression to regulate protein concentrations and reaction activities. The gene expression is controlled by the promoter sequence, and sequence libraries are used to scan expression activities and to identify correlations between sequence and activity. We introduce a computational workflow called Exp2Ipynb to analyze promoter libraries maximizing information retrieval and promoter design with desired activity. We applied Exp2Ipynb to seven prokaryotic expression libraries to identify optimal experimental design principles. The workflow is open source, available as Jupyter Notebooks and covers the steps to 1) generate a statistical overview to sequence and activity, 2) train machine-learning algorithms, such as random forest, gradient boosting trees and support vector machines, for prediction and extraction of feature importance, 3) evaluate the performance of the estimator, and 4) to design new sequences with a desired activity using numerical optimization. The workflow can perform regression or classification on multiple promoter libraries, across species or reporter proteins. The most accurate predictions in the sample libraries were achieved when the promoters in the library were recognized by a single sigma factor and a unique reporter system. The prediction confidence mostly depends on sample size and sequence diversity, and we present a relationship to estimate their respective effects. The workflow can be adapted to process sequence libraries from other expression-related problems and increase insight to the growing application of high-throughput experiments, providing support for efficient strain engineering.
RESUMEN
Model-based fuel design can tailor fuels to advanced engine concepts while minimizing environmental impact and production costs. A rationally designed ketone-ester-alcohol-alkane (KEAA) blend for high efficiency spark-ignition engines was assessed in a multi-disciplinary manner, from production cost to ignition characteristics, engine performance, ecotoxicity, microbial storage stability, and carbon footprint. The comparison included RON 95 E10, ethanol, and two previously designed fuels. KEAA showed high indicated efficiencies in a single-cylinder research engine. Ignition delay time measurements confirmed KEAA's high auto-ignition resistance. KEAA exhibits a moderate toxicity and is not prone to microbial infestation. A well-to-wheel analysis showed the potential to lower the carbon footprint by 95 percent compared to RON 95 E10. The findings motivate further investigations on KEAA and demonstrate advancements in model-based fuel design.
RESUMEN
Understanding the mechanisms of cell function and drug action is a major endeavor in the pharmaceutical industry. Drug effects are governed by the intrinsic properties of the drug (i.e., selectivity and potency) and the specific signaling transduction network of the host (i.e., normal vs. diseased cells). Here, we describe an unbiased, phosphoproteomic-based approach to identify drug effects by monitoring drug-induced topology alterations. With our proposed method, drug effects are investigated under diverse stimulations of the signaling network. Starting with a generic pathway made of logical gates, we build a cell-type specific map by constraining it to fit 13 key phopshoprotein signals under 55 experimental conditions. Fitting is performed via an Integer Linear Program (ILP) formulation and solution by standard ILP solvers; a procedure that drastically outperforms previous fitting schemes. Then, knowing the cell's topology, we monitor the same key phosphoprotein signals under the presence of drug and we re-optimize the specific map to reveal drug-induced topology alterations. To prove our case, we make a topology for the hepatocytic cell-line HepG2 and we evaluate the effects of 4 drugs: 3 selective inhibitors for the Epidermal Growth Factor Receptor (EGFR) and a non-selective drug. We confirm effects easily predictable from the drugs' main target (i.e., EGFR inhibitors blocks the EGFR pathway) but we also uncover unanticipated effects due to either drug promiscuity or the cell's specific topology. An interesting finding is that the selective EGFR inhibitor Gefitinib inhibits signaling downstream the Interleukin-1alpha (IL1alpha) pathway; an effect that cannot be extracted from binding affinity-based approaches. Our method represents an unbiased approach to identify drug effects on small to medium size pathways which is scalable to larger topologies with any type of signaling interventions (small molecules, RNAi, etc). The method can reveal drug effects on pathways, the cornerstone for identifying mechanisms of drug's efficacy.
Asunto(s)
Modelos Biológicos , Farmacología/métodos , Fosfoproteínas/metabolismo , Proteómica/métodos , Transducción de Señal/efectos de los fármacos , Algoritmos , Antineoplásicos/farmacología , Bases de Datos de Proteínas , Células Hep G2 , Humanos , Reproducibilidad de los ResultadosRESUMEN
Particle size distribution and in particular the mean particle size are key properties of microgels, which are determined by synthesis conditions. To describe particle growth and particle size distribution over the progress of synthesis of poly(N-vinylcaprolactam)-based microgels, a pseudo-bulk model for precipitation copolymerization with cross-linking is formulated. The model is fitted and compared to experimental data from reaction calorimetry and dynamic light scattering, showing good agreement with polymerization progress, final particle size, and narrow particle size distribution. Predictions of particle growth and reaction progress for different experimental setups are compared to the corresponding experimental data, demonstrating the predictive capability and limitations of the model. The comparison to reaction calorimetry measurements shows the strength in the prediction of the overall polymerization progress. The results for the prediction of the particle radii reveal significant deviations and highlight the demand for further investigation, including additional data.
RESUMEN
Poly(N-isopropylacrylamide) microgels have found various uses in fundamental polymer and colloid science as well as in different applications. They are conveniently prepared by precipitation polymerization. In this reaction, radical polymerization and colloidal stabilization interact with each other to produce well-defined thermosensitive particles of narrow size distribution. However, the underlying mechanism of precipitation polymerization has not been fully understood. In particular, the crucial early stages of microgel formation have been poorly investigated so far. In this contribution, we have used small-angle neutron scattering in conjunction with a stopped-flow device to monitor the particle growth during precipitation polymerization in situ. The average particle volume growth is found to follow pseudo-first order kinetics, indicating that the polymerization rate is determined by the availability of the unreacted monomer, as the initiator concentration does not change considerably during the reaction. This is confirmed by calorimetric investigation of the polymerization process. Peroxide initiator-induced self-crosslinking of N-isopropylacrylamide and the use of the bifunctional crosslinker N,N'-methylenebisacrylamide are shown to decrease the particle number density in the batch. The results of the in situ small-angle neutron scattering measurements indicate that the particles form at an early stage in the reaction and their number density remains approximately the same thereafter. The overall reaction rate is found to be sensitive to monomer and initiator concentration in accordance with a radical solution polymerization mechanism, supporting the results from our earlier studies.
RESUMEN
An event-driven approach based on dynamic optimization and nonlinear model predictive control (NMPC) is investigated together with inline Raman spectroscopy for process monitoring and control. The benefits and challenges in polymerization and morphology monitoring are presented, and an overview of the used mechanistic models and the details of the dynamic optimization and NMPC approach to achieve the relevant process objectives are provided. Finally, the implementation of the approach is discussed, and results from experiments in lab and pilot-plant reactors are presented.
RESUMEN
This contribution presents in-line monitoring of microgel synthesis by precipitation polymerization based on Raman spectroscopy. The spectra are evaluated via multivariate Indirect Hard Modeling (IHM) regression. Therefore, mechanistic models of the pure component spectra for solvent, monomer, and microgel are created by a sum of adaptable parameterized peak functions (Gaussian-Lorentzian). Instead of individual calibrations for each analyte, one comprehensive model is calibrated to predict both the monomer and microgel fraction while ensuring a consistent mass balance. As a novelty, this leads to an in-line microgel quantification based on an interactive spectral model. The results show cross-validation errors (RMSECV) of monomer and microgel fractions as low as 0.028 wt % and 0.084 wt %, respectively. The ability of IHM to account for non-linear spectral changes was found to reduce the microgel RMSECV by a factor of two compared to linear CLS regression. The calibration model allows simultaneous observation of the decrease in monomer content and the formation of microgels. Long as well as short focus immersion optics reveal characteristic vibrations of the turbid microgel suspension, although long focus optics are influenced by scattering particles to a greater extent. Precise examination of the model proves that the prediction is robust against changes in microgel particle size or temperature, which opens up the application of Raman spectroscopy as a comprehensive process analytical technology in microgel synthesis.
RESUMEN
Construction of large and cell-specific signaling pathways is essential to understand information processing under normal and pathological conditions. On this front, gene-based approaches offer the advantage of large pathway exploration whereas phosphoproteomic approaches offer a more reliable view of pathway activities but are applicable to small pathway sizes. In this paper, we demonstrate an experimentally adaptive approach to construct large signaling pathways from phosphoproteomic data within a 3-day time frame. Our approach--taking advantage of the fast turnaround time of the xMAP technology--is carried out in four steps: (i) screen optimal pathway inducers, (ii) select the responsive ones, (iii) combine them in a combinatorial fashion to construct a phosphoproteomic dataset, and (iv) optimize a reduced generic pathway via an Integer Linear Programming formulation. As a case study, we uncover novel players and their corresponding pathways in primary human hepatocytes by interrogating the signal transduction downstream of 81 receptors of interest and constructing a detailed model for the responsive part of the network comprising 177 species (of which 14 are measured) and 365 interactions.