RESUMEN
Mining of electronic health records (EHR) promises to automate the identification of comprehensive disease phenotypes. However, the realization of this promise is hindered by the unavailability of generalizable ground-truth information, data incompleteness and heterogeneity, and the lack of generalization to multiple cohorts. We present here a data-driven approach to identify clinical states that we implement for 585 critical care patients with suspected pneumonia recruited by the SCRIPT study, which we compare to and integrate with 9,918 pneumonia patients from the MIMIC-IV dataset. We extract and curate from their structured EHRs a primary set of clinical features (53 and 59 features for SCRIPT and MIMIC-IV, respectively), including disease severity scores, vital signs, and so on, at various degrees of completeness. We aggregate irregular time series into daily frequency, resulting in 12,495 and 94,684 patient-day pairs for SCRIPT and MIMIC, respectively. We define a "common-sense" ground truth that we then use in a semisupervised pipeline to optimize choices for data preprocessing, and reduce the feature space to four principal components. We describe and validate an ensemble-based clustering method that enables us to robustly identify five clinical states, and use a Gaussian mixture model to quantify uncertainty in cluster assignment. Demonstrating the clinical relevance of the identified states, we find that three states are strongly associated with disease outcomes (dying vs. recovering), while the other two reflect disease etiology. The outcome associated clinical states provide significantly increased discrimination of mortality rates over standard approaches.
Asunto(s)
Minería de Datos , Registros Electrónicos de Salud , Neumonía , Humanos , Neumonía/mortalidad , Neumonía/epidemiología , Minería de Datos/métodos , Masculino , Femenino , Análisis por ConglomeradosRESUMEN
Present-day publications on human genes primarily feature genes that already appeared in many publications prior to completion of the Human Genome Project in 2003. These patterns persist despite the subsequent adoption of high-throughput technologies, which routinely identify novel genes associated with biological processes and disease. Although several hypotheses for bias in the selection of genes as research targets have been proposed, their explanatory powers have not yet been compared. Our analysis suggests that understudied genes are systematically abandoned in favor of better-studied genes between the completion of -omics experiments and the reporting of results. Understudied genes remain abandoned by studies that cite these -omics experiments. Conversely, we find that publications on understudied genes may even accrue a greater number of citations. Among 45 biological and experimental factors previously proposed to affect which genes are being studied, we find that 33 are significantly associated with the choice of hit genes presented in titles and abstracts of -omics studies. To promote the investigation of understudied genes, we condense our insights into a tool, find my understudied genes (FMUG), that allows scientists to engage with potential bias during the selection of hits. We demonstrate the utility of FMUG through the identification of genes that remain understudied in vertebrate aging. FMUG is developed in Flutter and is available for download at fmug.amaral.northwestern.edu as a MacOS/Windows app.
Modern techniques for studying human genetics have helped to identify 20,000 protein-encoding genes in the human genome. Yet scientists have not studied most of them, including genes linked to human diseases in genome wide studies. For example, about 44% of the genes associated with Alzheimer's disease have never been mentioned in the title or summary of a scientific article. Why so many health-linked genes have yet to be examined is unclear. Many genetic studies instead focus on genes already studied before the Human Genome Project mapped the entire genome in 2003. There are many reasons why scientists may ignore potentially disease-causing genes. They may feel that well-studied genes are safer bets or more likely to result in high-profile publications. Or they may lack the tools to study less well-characterized genes. Richardson et al. analyzed the scientific literature for clues on why so many genes are being ignored by scientists. The analysis included hundreds of articles that used a wide range of genetic techniques, including genome-wide association studies, RNA sequencing, and gene editing tools to scour the genome for disease-linked genes. It revealed that scientists abandon the study of many genes early in the research process and identify 33 reasons why. Contrary to scientists' fears, Richardson et al. show that reports on understudied genes often garner more attention than studies on well-known genes. Richardson et al. used their results to create a downloadable tool called "Find My Understudied Genes (FMUG)" to help scientists identify understudied genes and counteract bias toward more well-studied genes. The app may help scientists make informed decisions about which understudied genes to research. If the tool helps boost investigation of understudied genes, it may help speed up progress towards understanding human genetics and how various genes may contribute to diseases.
Asunto(s)
Envejecimiento , Médicos , Humanos , BioensayoRESUMEN
Present-day publications on human genes primarily feature genes that already appeared in many publications prior to completion of the Human Genome Project in 2003. These patterns persist despite the subsequent adoption of high-throughput technologies, which routinely identify novel genes associated with biological processes and disease. Although several hypotheses for bias in the selection of genes as research targets have been proposed, their explanatory powers have not yet been compared. Our analysis suggests that understudied genes are systematically abandoned in favor of better-studied genes between the completion of -omics experiments and the reporting of results. Understudied genes remain abandoned by studies that cite these -omics experiments. Conversely, we find that publications on understudied genes may even accrue a greater number of citations. Among 45 biological and experimental factors previously proposed to affect which genes are being studied, we find that 33 are significantly associated with the choice of hit genes presented in titles and abstracts of - omics studies. To promote the investigation of understudied genes we condense our insights into a tool, find my understudied genes (FMUG), that allows scientists to engage with potential bias during the selection of hits. We demonstrate the utility of FMUG through the identification of genes that remain understudied in vertebrate aging. FMUG is developed in Flutter and is available for download at fmug.amaral.northwestern.edu as a MacOS/Windows app.
RESUMEN
Under-recognition of acute respiratory distress syndrome (ARDS) by clinicians is an important barrier to adoption of evidence-based practices such as low tidal volume ventilation. The burden created by the COVID-19 pandemic makes it even more critical to develop scalable data-driven tools to improve ARDS recognition. The objective of this study was to validate a tool for accurately estimating clinician ARDS recognition rates using discrete clinical characteristics easily available in electronic health records. We conducted a secondary analysis of 2,705 ARDS and 1,261 non-ARDS hypoxemic patients in the international LUNG SAFE cohort. The primary outcome was validation of a tool that estimates clinician ARDS recognition rates from health record data. Secondary outcomes included the relative impact of clinical characteristics on tidal volume delivery and clinician documentation of ARDS. In both ARDS and non-ARDS patients, greater height was associated with lower standardized tidal volume (mL/kg PBW) (ARDS: adjusted ß = -4.1, 95% CI -4.5 --3.6; non-ARDS: ß = -7.7, 95% CI -8.8 --6.7, P<0.00009 [where α = 0.01/111 with the Bonferroni correction]). Standardized tidal volume has already been normalized for patient height, and furthermore, height was not associated with clinician documentation of ARDS. Worsening hypoxemia was associated with both increased clinician documentation of ARDS (ß = -0.074, 95% CI -0.093 --0.056, P<0.00009) and lower standardized tidal volume (ß = 1.3, 95% CI 0.94-1.6, P<0.00009) in ARDS patients. Increasing chest imaging opacities, plateau pressure, and clinician documentation of ARDS also were associated with lower tidal volume in ARDS patients. Our EHR-based data-driven approach using height, gender, ARDS documentation, and lowest standardized tidal volume yielded estimates of clinician ARDS recognition rates of 54% for mild, 63% for moderate, and 73% for severe ARDS. Our tool replicated clinician-reported ARDS recognition in the LUNG SAFE study, enabling the identification of ARDS patients at high risk of being unrecognized. Our approach can be generalized to other conditions for which there is a need to increase adoption of evidence-based care.
RESUMEN
Transportation networks play a critical role in human mobility and the exchange of goods, but they are also the primary vehicles for the worldwide spread of infections, and account for a significant fraction of CO2 emissions. We investigate the edge removal dynamics of two mature but fast-changing transportation networks: the Brazilian domestic bus transportation network and the U.S. domestic air transportation network. We use machine learning approaches to predict edge removal on a monthly time scale and find that models trained on data for a given month predict edge removals for the same month with high accuracy. For the air transportation network, we also find that models trained for a given month are still accurate for other months even in the presence of external shocks. We take advantage of this approach to forecast the impact of a hypothetical dramatic reduction in the scale of the U.S. air transportation network as a result of policies to reduce CO2 emissions. Our forecasting approach could be helpful in building scenarios for planning future infrastructure.
Asunto(s)
Dióxido de Carbono , Transportes , Brasil , Dióxido de Carbono/análisis , Predicción , Humanos , Aprendizaje AutomáticoRESUMEN
Allostery governing two conformational states is one of the proposed mechanisms for catch-bond behavior in adhesive proteins. In FimH, a catch-bond protein expressed by pathogenic bacteria, separation of two domains disrupts inhibition by the pilin domain. Thus, tensile force can induce a conformational change in the lectin domain, from an inactive state to an active state with high affinity. To better understand allosteric inhibition in two-domain FimH (H2 inactive), we use molecular dynamics simulations to study the lectin domain alone, which has high affinity (HL active), and also the lectin domain stabilized in the low-affinity conformation by an Arg-60-Pro mutation (HL mutant). Because ligand-binding induces an allostery-like conformational change in HL mutant, this more experimentally tractable version has been proposed as a "minimal model" for FimH. We find that HL mutant has larger backbone fluctuations than both H2 inactive and HL active, at the binding pocket and allosteric interdomain region. We use an internal coordinate system of dihedral angles to identify protein regions with differences in backbone and side chain dynamics beyond the putative allosteric pathway sites. By characterizing HL mutant dynamics for the first time, we provide additional insight into the transmission of allosteric information across the lectin domain and build upon structural and thermodynamic data in the literature to further support the use of HL mutant as a "minimal model." Understanding how to alter protein dynamics to prevent the allosteric conformational change may guide drug development to prevent infection by blocking FimH adhesion.
Asunto(s)
Adhesinas de Escherichia coli , Proteínas Fimbrias , Adhesinas de Escherichia coli/química , Adhesinas de Escherichia coli/genética , Adhesinas de Escherichia coli/metabolismo , Sitio Alostérico , Proteínas Fimbrias/química , Proteínas Fimbrias/genética , Proteínas Fimbrias/metabolismo , Simulación de Dinámica Molecular , Mutación/genética , Conformación Proteica , Dominios Proteicos , Estabilidad Proteica , TermodinámicaRESUMEN
Female representation has been slowly but steadily increasing in many sectors of society. One sector where one would expect to see gender parity is the movie industry, yet the representation of females in most functions within the U.S. movie industry remain surprisingly low. Here, we study the historical patterns of female representation among actors, directors, and producers in an attempt to gain insights into the possible causes of the lack of gender parity in the industry. Our analyses reveals a remarkable temporal coincidence between the collapse in female representation across all functions and the advent of the Studio System, a period when the major Hollywood studios controlled all aspects of the industry. Female representation among actors, directors, producers and writers dropped to extraordinarily low values during the emergence and consolidation of the Studio System that in some cases have not yet recovered to pre-Studio System levels. In order to explore some possible mechanisms behind these patterns, we investigate the association between the gender balance of actors, writers, directors, and producers and a number of economic indicators, movie industry indicators, and movie characteristics. We find robust, strong, and significant associations which are consistent with an important role for the gender of decision makers on the gender balance of other industry functions. While in no way demonstrating causality, our findings add new perspectives to the discussions of the reasons for female under-representation in fields such as computer science and medicine, that have also experienced dramatic changes in female representation.
Asunto(s)
Identidad de Género , Industrias/estadística & datos numéricos , Películas Cinematográficas/estadística & datos numéricos , Toma de Decisiones , Femenino , Humanos , Masculino , Estados UnidosRESUMEN
In this Formal Comment, the authors of the recent publication "Large-scale investigation of the reasons why potentially important genes are ignored" maintain that it can be read as an opportunity to explore the unknown.
Asunto(s)
Publicaciones , EdiciónRESUMEN
Collaboration plays an increasingly important role in promoting research productivity and impact. What remains unclear is whether female and male researchers in science, technology, engineering, and mathematical (STEM) disciplines differ in their collaboration propensity. Here, we report on an empirical analysis of the complete publication records of 3,980 faculty members in six STEM disciplines at select U.S. research universities. We find that female faculty have significantly fewer distinct co-authors over their careers than males, but that this difference can be fully accounted for by females' lower publication rate and shorter career lengths. Next, we find that female scientists have a lower probability of repeating previous co-authors than males, an intriguing result because prior research shows that teams involving new collaborations produce work with higher impact. Finally, we find evidence for gender segregation in some sub-disciplines in molecular biology, in particular in genomics where we find female faculty to be clearly under-represented.
Asunto(s)
Conducta Cooperativa , Ocupaciones , Edición , Factores Sexuales , Docentes , Femenino , Humanos , Masculino , InvestigaciónRESUMEN
How to quantify the impact of a researcher's or an institution's body of work is a matter of increasing importance to scientists, funding agencies, and hiring committees. The use of bibliometric indicators, such as the h-index or the Journal Impact Factor, have become widespread despite their known limitations. We argue that most existing bibliometric indicators are inconsistent, biased, and, worst of all, susceptible to manipulation. Here, we pursue a principled approach to the development of an indicator to quantify the scientific impact of both individual researchers and research institutions grounded on the functional form of the distribution of the asymptotic number of citations. We validate our approach using the publication records of 1,283 researchers from seven scientific and engineering disciplines and the chemistry departments at the 106 U.S. research institutions classified as "very high research activity". Our approach has three distinct advantages. First, it accurately captures the overall scientific impact of researchers at all career stages, as measured by asymptotic citation counts. Second, unlike other measures, our indicator is resistant to manipulation and rewards publication quality over quantity. Third, our approach captures the time-evolution of the scientific impact of research institutions.
Asunto(s)
Modelos Teóricos , Publicaciones/estadística & datos numéricos , Investigadores/estadística & datos numéricos , Academias e Institutos , Bases de Datos FactualesRESUMEN
High-throughput experimental techniques and bioinformatics tools make it possible to obtain reconstructions of the metabolism of microbial species. Combined with mathematical frameworks such as flux balance analysis, which assumes that nutrients are used so as to maximize growth, these reconstructions enable us to predict microbial growth. Although such predictions are generally accurate, these approaches do not give insights on how different nutrients are used to produce growth, and thus are difficult to generalize to new media or to different organisms. Here, we propose a systems-level phenomenological model of metabolism inspired by the virial expansion. Our model predicts biomass production given the nutrient uptakes and a reduced set of parameters, which can be easily determined experimentally. To validate our model, we test it against in silico simulations and experimental measurements of growth, and find good agreement. From a biological point of view, our model uncovers the impact that individual nutrients and the synergistic interaction between nutrient pairs have on growth, and suggests that we can understand the growth maximization principle as the optimization of nutrient synergies.
Asunto(s)
Crecimiento , Fenómenos Microbiológicos , Modelos Biológicos , Biomasa , Simulación por ComputadorRESUMEN
In a world overflowing with creative works, it is useful to be able to filter out the unimportant works so that the significant ones can be identified and thereby absorbed. An automated method could provide an objective approach for evaluating the significance of works on a universal scale. However, there have been few attempts at creating such a measure, and there are few "ground truths" for validating the effectiveness of potential metrics for significance. For movies, the US Library of Congress's National Film Registry (NFR) contains American films that are "culturally, historically, or aesthetically significant" as chosen through a careful evaluation and deliberation process. By analyzing a network of citations between 15,425 United States-produced films procured from the Internet Movie Database (IMDb), we obtain several automated metrics for significance. The best of these metrics is able to indicate a film's presence in the NFR at least as well or better than metrics based on aggregated expert opinions or large population surveys. Importantly, automated metrics can easily be applied to older films for which no other rating may be available. Our results may have implications for the evaluation of other creative works such as scientific research.
RESUMEN
Although the mapping of codon to amino acid is conserved across nearly all species, the frequency at which synonymous codons are used varies both between organisms and between genes from the same organism. This variation affects diverse cellular processes including protein expression, regulation, and folding. Here, we mathematically model an additional layer of complexity and show that individual codon usage biases follow a position-dependent exponential decay model with unique parameter fits for each codon. We use this methodology to perform an in-depth analysis on codon usage bias in the model organism Escherichia coli. Our methodology shows that lowly and highly expressed genes are more similar in their codon usage patterns in the 5'-gene regions, but that these preferences diverge at distal sites resulting in greater positional dependency (pD, which we mathematically define later) for highly expressed genes. We show that position-dependent codon usage bias is partially explained by the structural requirements of mRNAs that results in increased usage of A/T rich codons shortly after the gene start. However, we also show that the pD of 4- and 6-fold degenerate codons is partially related to the gene copy number of cognate-tRNAs supporting existing hypotheses that posit benefits to a region of slow translation in the beginning of coding sequences. Lastly, we demonstrate that viewing codon usage bias through a position-dependent framework has practical utility by improving accuracy of gene expression prediction when incorporating positional dependencies into the Codon Adaptation Index model.
Asunto(s)
Codón , Proteínas de Escherichia coli/genética , Escherichia coli/genética , ADN Bacteriano , Evolución Molecular , Variación Genética , Funciones de Verosimilitud , Modelos Genéticos , FilogeniaRESUMEN
Genetic algorithms (GAs) have been used to find efficient solutions to numerous fundamental and applied problems. While GAs are a robust and flexible approach to solve complex problems, there are some situations under which they perform poorly. Here, we introduce a genetic algorithm approach that is able to solve complex tasks plagued by so-called ''golf-course''-like fitness landscapes. Our approach, which we denote variable environment genetic algorithms (VEGAs), is able to find highly efficient solutions by inducing environmental changes that require more complex solutions and thus creating an evolutionary drive. Using the density classification task, a paradigmatic computer science problem, as a case study, we show that more complex rules that preserve information about the solution to simpler tasks can adapt to more challenging environments. Interestingly, we find that conservative strategies, which have a bias toward the current state, evolve naturally as a highly efficient solution to the density classification task under noisy conditions.
Asunto(s)
Algoritmos , Genoma , Modelos Genéticos , Evolución Biológica , Simulación por Computador , Método de MontecarloRESUMEN
Social groups of interacting agents display an ability to coordinate in the absence of a central authority, a phenomenon that has been recently amplified by the widespread availability of social networking technologies. Models of opinion formation in a population of agents have proven a very useful tool to investigate these phenomena that arise independently of the heterogeneities across individuals and can be used to identify the factors that determine whether widespread consensus on an initial small majority is reached. Recently, we introduced a model in which individual agents can have conservative and partisan biases. Numerical simulations for finite populations showed that while the inclusion of conservative agents in a population enhances the population's efficiency in reaching consensus on the initial majority opinion, even a small fraction of partisans leads the population to converge on the opinion initially held by a minority. To further understand the mechanisms leading to our previous numerical results, we investigate analytically the noise driven transition from a regime in which the population reaches a majority consensus (efficient), to a regime in which the population settles in deadlock (non-efficient). We show that the mean-field solution captures what we observe in model simulations. Populations of agents with no opinion bias show a continuous transition to a deadlock regime, while populations with an opinion bias, show a discontinuous transition between efficient and partisan regimes. Furthermore, the analytical solution reveals that populations with an increasing fraction of conservative agents are more robust against noise than a population of naive agents because in the efficient regime there are relatively more conservative than naive agents holding the majority opinion. In contrast, populations with partisan agents are less robust to noise with an increasing fraction of partisans, because in the efficient regime there are relatively more naive agents than partisan agents holding the majority opinion.
Asunto(s)
Consenso , Prejuicio , Simulación por Computador , Humanos , Modelos Teóricos , Opinión PúblicaRESUMEN
The ability of microbial species to consume compounds found in the environment to generate commercially-valuable products has long been exploited by humanity. The untapped, staggering diversity of microbial organisms offers a wealth of potential resources for tackling medical, environmental, and energy challenges. Understanding microbial metabolism will be crucial to many of these potential applications. Thermodynamically-feasible metabolic reconstructions can be used, under some conditions, to predict the growth rate of certain microbes using constraint-based methods. While these reconstructions are powerful, they are still cumbersome to build and, because of the complexity of metabolic networks, it is hard for researchers to gain from these reconstructions an understanding of why a certain nutrient yields a given growth rate for a given microbe. Here, we present a simple model of biomass production that accurately reproduces the predictions of thermodynamically-feasible metabolic reconstructions. Our model makes use of only: i) a nutrient's structure and function, ii) the presence of a small number of enzymes in the organism, and iii) the carbon flow in pathways that catabolize nutrients. When applied to test organisms, our model allows us to predict whether a nutrient can be a carbon source with an accuracy of about 90% with respect to in silico experiments. In addition, our model provides excellent predictions of whether a medium will produce more or less growth than another (p<10(-6)) and good predictions of the actual value of the in silico biomass production.
Asunto(s)
Bacterias/metabolismo , Modelos Biológicos , Saccharomyces cerevisiae/metabolismo , Biología de Sistemas/métodos , Biomasa , Carbono/metabolismo , Ciclo del Carbono , Simulación por Computador , Metabolismo , Reproducibilidad de los ResultadosRESUMEN
Studying the interaction between a system's components and the temporal evolution of the system are two common ways to uncover and characterize its internal workings. Recently, several maps from a time series to a network have been proposed with the intent of using network metrics to characterize time series. Although these maps demonstrate that different time series result in networks with distinct topological properties, it remains unclear how these topological properties relate to the original time series. Here, we propose a map from a time series to a network with an approximate inverse operation, making it possible to use network statistics to characterize time series and time series statistics to characterize networks. As a proof of concept, we generate an ensemble of time series ranging from periodic to random and confirm that application of the proposed map retains much of the information encoded in the original time series (or networks) after application of the map (or its inverse). Our results suggest that network analysis can be used to distinguish different dynamic regimes in time series and, perhaps more importantly, time series analysis can provide a powerful set of tools that augment the traditional network analysis toolkit to quantify networks in new and useful ways.
Asunto(s)
Algoritmos , Modelos Biológicos , Transducción de Señal/fisiología , Arabidopsis/fisiología , Simulación por Computador , Frecuencia Cardíaca/fisiología , Humanos , Cinética , Factores de TiempoRESUMEN
1. The idea that species occupy distinct niches is a fundamental concept in ecology. Classically, the niche was described as an n-dimensional hypervolume where each dimension represents a biotic or abiotic characteristic. More recently, it has been hypothesised that a single dimension may be sufficient to explain the system-level organization of trophic interactions observed between species in a community. 2. Here, we test the hypothesis that species body mass is that single dimension. Specifically, we determine how the intervality of food webs ordered by body size compares to that of randomly ordered food webs. We also extend this analysis beyond the community level to the effect of body mass in explaining the diets of individual species. 3. We conclude that body mass significantly explains the ordering of species and the contiguity of diets in empirical communities. 4. At the species-specific level, we find that the degree to which body mass is a significant explanatory variable depends strongly on the phylogenetic history, suggesting that other evolutionarily conserved traits partly account for species' roles in the food web. 5. Our investigation of the role of body mass in food webs thus helps us to better understand the important features of community food-web structure and the evolutionary forces that have led us to the communities we observe.