RESUMO
MOTIVATION: Knowledge graphs (KGs) are being adopted in industry, commerce and academia. Biomedical KG presents a challenge due to the complexity, size and heterogeneity of the underlying information. RESULTS: In this work, we present the Scalable Precision Medicine Open Knowledge Engine (SPOKE), a biomedical KG connecting millions of concepts via semantically meaningful relationships. SPOKE contains 27 million nodes of 21 different types and 53 million edges of 55 types downloaded from 41 databases. The graph is built on the framework of 11 ontologies that maintain its structure, enable mappings and facilitate navigation. SPOKE is built weekly by python scripts which download each resource, check for integrity and completeness, and then create a 'parent table' of nodes and edges. Graph queries are translated by a REST API and users can submit searches directly via an API or a graphical user interface. Conclusions/Significance: SPOKE enables the integration of seemingly disparate information to support precision medicine efforts. AVAILABILITY AND IMPLEMENTATION: The SPOKE neighborhood explorer is available at https://spoke.rbvi.ucsf.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Reconhecimento Automatizado de Padrão , Medicina de Precisão , Bases de Dados FactuaisRESUMO
In vivo chromosomal behavior is dictated by the organization of genomic DNA at length scales ranging from nanometers to microns. At these disparate scales, the DNA conformation is influenced by a range of proteins that package, twist and disentangle the DNA double helix, leading to a complex hierarchical structure that remains undetermined. Thus, there is a critical need for methods of structural characterization of DNA that can accommodate complex environmental conditions over biologically relevant length scales. Based on multiscale molecular simulations, we report on the possibility of measuring supercoiling in complex environments using angular correlations of scattered X-rays resulting from X-ray free electron laser (xFEL) experiments. We recently demonstrated the observation of structural detail for solutions of randomly oriented metallic nanoparticles [D. Mendez et al., Philos. Trans. R. Soc. B360 (2014) 20130315]. Here, we argue, based on simulations, that correlated X-ray scattering (CXS) has the potential for measuring the distribution of DNA folds in complex environments, on the scale of a few persistence lengths.
RESUMO
Objectives: Clinical notes are a veritable treasure trove of information on a patient's disease progression, medical history, and treatment plans, yet are locked in secured databases accessible for research only after extensive ethics review. Removing personally identifying and protected health information (PII/PHI) from the records can reduce the need for additional Institutional Review Boards (IRB) reviews. In this project, our goals were to: (1) develop a robust and scalable clinical text de-identification pipeline that is compliant with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule for de-identification standards and (2) share routinely updated de-identified clinical notes with researchers. Materials and Methods: Building on our open-source de-identification software called Philter, we added features to: (1) make the algorithm and the de-identified data HIPAA compliant, which also implies type 2 error-free redaction, as certified via external audit; (2) reduce over-redaction errors; and (3) normalize and shift date PHI. We also established a streamlined de-identification pipeline using MongoDB to automatically extract clinical notes and provide truly de-identified notes to researchers with periodic monthly refreshes at our institution. Results: To the best of our knowledge, the Philter V1.0 pipeline is currently the first and only certified, de-identified redaction pipeline that makes clinical notes available to researchers for nonhuman subjects' research, without further IRB approval needed. To date, we have made over 130 million certified de-identified clinical notes available to over 600 UCSF researchers. These notes were collected over the past 40 years, and represent data from 2757016 UCSF patients.
RESUMO
Protein structures often show similarities to another which would not be seen at the sequence level. Given the coordinates of a protein chain, the SALAMI server at www.zbh.uni-hamburg.de/salami will search the protein data bank and return a set of similar structures without using sequence information. The results page lists the related proteins, details of the sequence and structure similarity and implied sequence alignments. Via a simple structure viewer, one can view superpositions of query and library structures and finally download superimposed coordinates. The alignment method is very tolerant of large gaps and insertions, and tends to produce slightly longer alignments than other similar programs.
Assuntos
Software , Homologia Estrutural de Proteína , Sequência de Aminoácidos , Bases de Dados de Proteínas , Alinhamento de Sequência , Interface Usuário-ComputadorRESUMO
To evaluate whether different approaches in note text preparation (known as preprocessing) can impact machine learning model performance in the case of mortality prediction ICU. DESIGN: Clinical note text was used to build machine learning models for adults admitted to the ICU. Preprocessing strategies studied were none (raw text), cleaning text, stemming, term frequency-inverse document frequency vectorization, and creation of n-grams. Model performance was assessed by the area under the receiver operating characteristic curve. Models were trained and internally validated on University of California San Francisco data using 10-fold cross validation. These models were then externally validated on Beth Israel Deaconess Medical Center data. SETTING: ICUs at University of California San Francisco and Beth Israel Deaconess Medical Center. SUBJECTS: Ten thousand patients in the University of California San Francisco training and internal testing dataset and 27,058 patients in the external validation dataset, Beth Israel Deaconess Medical Center. INTERVENTIONS: None. MEASUREMENTS AND MAIN RESULTS: Mortality rate at Beth Israel Deaconess Medical Center and University of California San Francisco was 10.9% and 7.4%, respectively. Data are presented as area under the receiver operating characteristic curve (95% CI) for models validated at University of California San Francisco and area under the receiver operating characteristic curve for models validated at Beth Israel Deaconess Medical Center. Models built and trained on University of California San Francisco data for the prediction of inhospital mortality improved from the raw note text model (AUROC, 0.84; CI, 0.80-0.89) to the term frequency-inverse document frequency model (AUROC, 0.89; CI, 0.85-0.94). When applying the models developed at University of California San Francisco to Beth Israel Deaconess Medical Center data, there was a similar increase in model performance from raw note text (area under the receiver operating characteristic curve at Beth Israel Deaconess Medical Center: 0.72) to the term frequency-inverse document frequency model (area under the receiver operating characteristic curve at Beth Israel Deaconess Medical Center: 0.83). CONCLUSIONS: Differences in preprocessing strategies for note text impacted model discrimination. Completing a preprocessing pathway including cleaning, stemming, and term frequency-inverse document frequency vectorization resulted in the preprocessing strategy with the greatest improvement in model performance. Further study is needed, with particular emphasis on how to manage author implicit bias present in note text, before natural language processing algorithms are implemented in the clinical setting.
RESUMO
There is a great and growing need to ascertain what exactly is the state of a patient, in terms of disease progression, actual care practices, pathology, adverse events, and much more, beyond the paucity of data available in structured medical record data. Ascertaining these harder-to-reach data elements is now critical for the accurate phenotyping of complex traits, detection of adverse outcomes, efficacy of off-label drug use, and longitudinal patient surveillance. Clinical notes often contain the most detailed and relevant digital information about individual patients, the nuances of their diseases, the treatment strategies selected by physicians, and the resulting outcomes. However, notes remain largely unused for research because they contain Protected Health Information (PHI), which is synonymous with individually identifying data. Previous clinical note de-identification approaches have been rigid and still too inaccurate to see any substantial real-world use, primarily because they have been trained with too small medical text corpora. To build a new de-identification tool, we created the largest manually annotated clinical note corpus for PHI and develop a customizable open-source de-identification software called Philter ("Protected Health Information filter"). Here we describe the design and evaluation of Philter, and show how it offers substantial real-world improvements over prior methods.
RESUMO
Epigenetic landscapes can shape physiologic and disease phenotypes. We used integrative, high resolution multi-omics methods to delineate the methylome landscape and characterize the oncogenic drivers of esophageal squamous cell carcinoma (ESCC). We found 98% of CpGs are hypomethylated across the ESCC genome. Hypo-methylated regions are enriched in areas with heterochromatin binding markers (H3K9me3, H3K27me3), while hyper-methylated regions are enriched in polycomb repressive complex (EZH2/SUZ12) recognizing regions. Altered methylation in promoters, enhancers, and gene bodies, as well as in polycomb repressive complex occupancy and CTCF binding sites are associated with cancer-specific gene dysregulation. Epigenetic-mediated activation of non-canonical WNT/ß-catenin/MMP signaling and a YY1/lncRNA ESCCAL-1/ribosomal protein network are uncovered and validated as potential novel ESCC driver alterations. This study advances our understanding of how epigenetic landscapes shape cancer pathogenesis and provides a resource for biomarker and target discovery.
Assuntos
Biomarcadores Tumorais/genética , Epigênese Genética , Neoplasias Esofágicas/genética , Carcinoma de Células Escamosas do Esôfago/genética , Regulação Neoplásica da Expressão Gênica , Idoso , Linhagem Celular Tumoral , Sequenciamento de Cromatina por Imunoprecipitação , Estudos de Coortes , Ilhas de CpG , Metilação de DNA , Conjuntos de Dados como Assunto , Neoplasias Esofágicas/patologia , Neoplasias Esofágicas/cirurgia , Carcinoma de Células Escamosas do Esôfago/patologia , Carcinoma de Células Escamosas do Esôfago/cirurgia , Esofagectomia , Esôfago/patologia , Esôfago/cirurgia , Feminino , Genômica , Heterocromatina/metabolismo , Histonas/genética , Histonas/metabolismo , Humanos , Masculino , Pessoa de Meia-Idade , Regiões Promotoras Genéticas/genética , Proteômica , RNA-Seq , Sequenciamento Completo do GenomaRESUMO
The iron-regulated protein FrpD from Neisseria meningitidis is an outer membrane lipoprotein that interacts with very high affinity (Kd ~ 0.2 nM) with the N-terminal domain of FrpC, a Type I-secreted protein from the Repeat in ToXin (RTX) protein family. In the presence of Ca2+, FrpC undergoes Ca2+ -dependent protein trans-splicing that includes an autocatalytic cleavage of the Asp414-Pro415 peptide bond and formation of an Asp414-Lys isopeptide bond. Here, we report the high-resolution structure of FrpD and describe the structure-function relationships underlying the interaction between FrpD and FrpC1-414. We identified FrpD residues involved in FrpC1-414 binding, which enabled localization of FrpD within the low-resolution SAXS model of the FrpD-FrpC1-414 complex. Moreover, the trans-splicing activity of FrpC resulted in covalent linkage of the FrpC1-414 fragment to plasma membrane proteins of epithelial cells in vitro, suggesting that formation of the FrpD-FrpC1-414 complex may be involved in the interaction of meningococci with the host cell surface.
Assuntos
Proteínas de Bactérias/química , Proteínas de Ligação ao Ferro/química , Proteínas de Membrana/química , Neisseria meningitidis/química , Sequência de Aminoácidos/genética , Proteínas da Membrana Bacteriana Externa/metabolismo , Proteínas de Bactérias/genética , Adesão Celular/genética , Humanos , Ferro/química , Ferro/metabolismo , Proteínas de Ligação ao Ferro/metabolismo , Lipoproteínas/química , Lipoproteínas/metabolismo , Proteínas de Membrana/genética , Neisseria meningitidis/genética , Proteínas Periplásmicas de Ligação/química , Proteínas Periplásmicas de Ligação/metabolismo , Difração de Raios XRESUMO
During X-ray exposure of a molecular solution, photons scattered from the same molecule are correlated. If molecular motion is insignificant during exposure, then differences in momentum transfer between correlated photons are direct measurements of the molecular structure. In conventional small- and wide-angle solution scattering, photon correlations are ignored. This report presents advances in a new biomolecular structural analysis technique, correlated X-ray scattering (CXS), which uses angular intensity correlations to recover hidden structural details from molecules in solution. Due to its intense rapid pulses, an X-ray free electron laser (XFEL) is an excellent tool for CXS experiments. A protocol is outlined for analysis of a CXS data set comprising a total of half a million X-ray exposures of solutions of small gold nanoparticles recorded at the Spring-8â Ångström Compact XFEL facility (SACLA). From the scattered intensities and their correlations, two populations of nanoparticle domains within the solution are distinguished: small twinned, and large probably non-twinned domains. It is shown analytically how, in a solution measurement, twinning information is only accessible via intensity correlations, demonstrating how CXS reveals atomic-level information from a disordered solution of like molecules.
RESUMO
Netrin-1 is a guidance cue that can trigger either attraction or repulsion effects on migrating axons of neurons, depending on the repertoire of receptors available on the growth cone. How a single chemotropic molecule can act in such contradictory ways has long been a puzzle at the molecular level. Here we present the crystal structure of netrin-1 in complex with the Deleted in Colorectal Cancer (DCC) receptor. We show that one netrin-1 molecule can simultaneously bind to two DCC molecules through a DCC-specific site and through a unique generic receptor binding site, where sulfate ions staple together positively charged patches on both DCC and netrin-1. Furthermore, we demonstrate that UNC5A can replace DCC on the generic receptor binding site to switch the response from attraction to repulsion. We propose that the modularity of binding allows for the association of other netrin receptors at the generic binding site, eliciting alternative turning responses.
Assuntos
Axônios/fisiologia , Quimiotaxia , Fatores de Crescimento Neural/química , Fatores de Crescimento Neural/metabolismo , Receptores de Superfície Celular/metabolismo , Proteínas Supressoras de Tumor/química , Proteínas Supressoras de Tumor/metabolismo , Sítios de Ligação , Células Cultivadas , Cristalografia por Raios X , Sinais (Psicologia) , Receptor DCC , Evolução Molecular , Modelos Moleculares , Receptores de Netrina , Netrina-1 , Ligação Proteica , Receptores de Superfície Celular/químicaRESUMO
BACKGROUND: Protein structure alignments are usually based on very different techniques to sequence alignments. We propose a method which treats sequence, structure and even combined sequence + structure in a single framework. Using a probabilistic approach, we calculate a similarity measure which can be applied to fragments containing only protein sequence, structure or both simultaneously. RESULTS: Proof-of-concept results are given for the different problems. For sequence alignments, the methodology is no better than conventional methods. For structure alignments, the techniques are very fast, reliable and tolerant of a range of alignment parameters. Combined sequence and structure alignments may provide a more reliable alignment for pairs of proteins where pure structural alignments can be misled by repetitive elements or apparent symmetries. CONCLUSION: The probabilistic framework has an elegance in principle, merging sequence and structure descriptors into a single framework. It has a practical use in fast structural alignments and a potential use in finding those examples where sequence and structural similarities apparently disagree.