Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 75
Filter
2.
Clin Transl Sci ; 15(8): 1848-1855, 2022 08.
Article in English | MEDLINE | ID: mdl-36125173

ABSTRACT

Within clinical, biomedical, and translational science, an increasing number of projects are adopting graphs for knowledge representation. Graph-based data models elucidate the interconnectedness among core biomedical concepts, enable data structures to be easily updated, and support intuitive queries, visualizations, and inference algorithms. However, knowledge discovery across these "knowledge graphs" (KGs) has remained difficult. Data set heterogeneity and complexity; the proliferation of ad hoc data formats; poor compliance with guidelines on findability, accessibility, interoperability, and reusability; and, in particular, the lack of a universally accepted, open-access model for standardization across biomedical KGs has left the task of reconciling data sources to downstream consumers. Biolink Model is an open-source data model that can be used to formalize the relationships between data structures in translational science. It incorporates object-oriented classification and graph-oriented features. The core of the model is a set of hierarchical, interconnected classes (or categories) and relationships between them (or predicates) representing biomedical entities such as gene, disease, chemical, anatomic structure, and phenotype. The model provides class and edge attributes and associations that guide how entities should relate to one another. Here, we highlight the need for a standardized data model for KGs, describe Biolink Model, and compare it with other models. We demonstrate the utility of Biolink Model in various initiatives, including the Biomedical Data Translator Consortium and the Monarch Initiative, and show how it has supported easier integration and interoperability of biomedical KGs, bringing together knowledge from multiple sources and helping to realize the goals of translational science.


Subject(s)
Pattern Recognition, Automated , Translational Science, Biomedical , Knowledge
3.
HGG Adv ; 3(3): 100123, 2022 Jul 14.
Article in English | MEDLINE | ID: mdl-35789587

ABSTRACT

The 1000 Genomes Project (TGP) is a foundational resource that serves the biomedical community as a standard reference cohort for human genetic variation. There are now seven public versions of these genomes. The TGP Consortium produced the first by mapping its final data release against human reference sequence GRCh37, then "lifted over" these genomes to the improved reference sequence (GRCh38) when it was released, and remapped the original data to GRCh38 with two similar pipelines. As best-practice quality validation, the pipelines that generated these versions were benchmarked against the Genome In A Bottle Consortium's "platinum quality" genome (NA12878). The New York Genome Center recently released the results of independently resequencing the cohort at greater depth (30×), a phased version informed by the inclusion of related individuals, and independently remapped the original variant calls to GRCh38. We performed a cross-comparison evaluation of all seven versions using genome fingerprinting, which supports ultrafast genome comparison even across reference versions. We noted multiple issues, including discrepancies in cohort membership, disagreement on the overall level of variation, evidence of substandard pipeline performance on specific genomes and in specific regions of the genome, cryptic relationships between individuals, inconsistent phasing, and annotation distortions caused by the history of the reference genome itself. We therefore recommend global quality assessment by rapid genome comparisons, alongside benchmarking as part of best-practice quality assessment of large genome datasets. Our observations also help inform the decision of which version to use, to support analyses by individual researchers.

4.
Clin Transl Sci ; 2022 May 25.
Article in English | MEDLINE | ID: mdl-35611543

ABSTRACT

Clinical, biomedical, and translational science has reached an inflection point in the breadth and diversity of available data and the potential impact of such data to improve human health and well-being. However, the data are often siloed, disorganized, and not broadly accessible due to discipline-specific differences in terminology and representation. To address these challenges, the Biomedical Data Translator Consortium has developed and tested a pilot knowledge graph-based "Translator" system capable of integrating existing biomedical data sets and "translating" those data into insights intended to augment human reasoning and accelerate translational science. Having demonstrated feasibility of the Translator system, the Translator program has since moved into development, and the Translator Consortium has made significant progress in the research, design, and implementation of an operational system. Herein, we describe the current system's architecture, performance, and quality of results. We apply Translator to several real-world use cases developed in collaboration with subject-matter experts. Finally, we discuss the scientific and technical features of Translator and compare those features to other state-of-the-art, biomedical graph-based question-answering systems.

7.
Nat Metab ; 3(2): 274-286, 2021 02.
Article in English | MEDLINE | ID: mdl-33619379

ABSTRACT

The gut microbiome has important effects on human health, yet its importance in human ageing remains unclear. In the present study, we demonstrate that, starting in mid-to-late adulthood, gut microbiomes become increasingly unique to individuals with age. We leverage three independent cohorts comprising over 9,000 individuals and find that compositional uniqueness is strongly associated with microbially produced amino acid derivatives circulating in the bloodstream. In older age (over ~80 years), healthy individuals show continued microbial drift towards a unique compositional state, whereas this drift is absent in less healthy individuals. The identified microbiome pattern of healthy ageing is characterized by a depletion of core genera found across most humans, primarily Bacteroides. Retaining a high Bacteroides dominance into older age, or having a low gut microbiome uniqueness measure, predicts decreased survival in a 4-year follow-up. Our analysis identifies increasing compositional uniqueness of the gut microbiome as a component of healthy ageing, which is characterized by distinct microbial metabolic outputs in the blood.


Subject(s)
Gastrointestinal Microbiome/physiology , Healthy Aging/physiology , Adolescent , Adult , Aged , Aged, 80 and over , Amino Acids/blood , Bacteroides/metabolism , Cohort Studies , Female , Humans , Life Style , Male , Metabolomics , Middle Aged , Predictive Value of Tests , Survival Analysis , Young Adult
8.
Cell Rep ; 32(7): 108029, 2020 08 18.
Article in English | MEDLINE | ID: mdl-32814038

ABSTRACT

Characterizing the tissue-specific binding sites of transcription factors (TFs) is essential to reconstruct gene regulatory networks and predict functions for non-coding genetic variation. DNase-seq footprinting enables the prediction of genome-wide binding sites for hundreds of TFs simultaneously. Despite the public availability of high-quality DNase-seq data from hundreds of samples, a comprehensive, up-to-date resource for the locations of genomic footprints is lacking. Here, we develop a scalable footprinting workflow using two state-of-the-art algorithms: Wellington and HINT. We apply our workflow to detect footprints in 192 ENCODE DNase-seq experiments and predict the genomic occupancy of 1,515 human TFs in 27 human tissues. We validate that these footprints overlap true-positive TF binding sites from ChIP-seq. We demonstrate that the locations, depth, and tissue specificity of footprints predict effects of genetic variants on gene expression and capture a substantial proportion of genetic risk for complex traits.


Subject(s)
Binding Sites/genetics , Deoxyribonucleases/metabolism , Genomics/methods , Transcription Factors/metabolism , Humans
9.
J Proteome Res ; 19(1): 346-359, 2020 01 03.
Article in English | MEDLINE | ID: mdl-31618575

ABSTRACT

Lyme disease results from infection of humans with the spirochete Borrelia burgdorferi. The first and most common clinical manifestation is the circular, inflamed skin lesion referred to as erythema migrans; later manifestations result from infections of other body sites. Laboratory diagnosis of Lyme disease can be challenging in patients with erythema migrans because of the time delay in the development of specific diagnostic antibodies against Borrelia. Reliable blood biomarkers for the early diagnosis of Lyme disease in patients with erythema migrans are needed. Here, we performed selected reaction monitoring, a targeted mass spectrometry-based approach, to measure selected proteins that (1) are known to be predominantly expressed in one organ (i.e., organ-specific blood proteins) and whose blood concentrations may change as a result of Lyme disease, or (2) are involved in acute immune responses. In a longitudinal cohort of 40 Lyme disease patients and 20 healthy controls, we identified 10 proteins with significantly altered serum levels in patients at the time of diagnosis, and we also developed a 10-protein panel identified through multivariate analysis. In an independent cohort of patients with erythema migrans, six of these proteins, APOA4, C9, CRP, CST6, PGLYRP2, and S100A9, were confirmed to show significantly altered serum levels in patients at time of presentation. Nine of the 10 proteins from the multivariate panel were also verified in the second cohort. These proteins, primarily innate immune response proteins or proteins specific to liver, skin, or white blood cells, may serve as candidate blood biomarkers requiring further validation to aid in the laboratory diagnosis of early Lyme disease.


Subject(s)
Acute-Phase Proteins/analysis , Lyme Disease/blood , Adult , Aged , Biomarkers/blood , Blotting, Western , Case-Control Studies , Erythema Chronicum Migrans/blood , Erythema Chronicum Migrans/etiology , Female , Humans , Immunity, Innate , Lyme Disease/drug therapy , Lyme Disease/etiology , Lyme Disease/immunology , Male , Middle Aged , Multivariate Analysis , Organ Specificity
10.
J Biomed Inform ; 100: 103325, 2019 12.
Article in English | MEDLINE | ID: mdl-31676459

ABSTRACT

This special communication describes activities, products, and lessons learned from a recent hackathon that was funded by the National Center for Advancing Translational Sciences via the Biomedical Data Translator program ('Translator'). Specifically, Translator team members self-organized and worked together to conceptualize and execute, over a five-day period, a multi-institutional clinical research study that aimed to examine, using open clinical data sources, relationships between sex, obesity, diabetes, and exposure to airborne fine particulate matter among patients with severe asthma. The goal was to develop a proof of concept that this new model of collaboration and data sharing could effectively produce meaningful scientific results and generate new scientific hypotheses. Three Translator Clinical Knowledge Sources, each of which provides open access (via Application Programming Interfaces) to data derived from the electronic health record systems of major academic institutions, served as the source of study data. Jupyter Python notebooks, shared in GitHub repositories, were used to call the knowledge sources and analyze and integrate the results. The results replicated established or suspected relationships between sex, obesity, diabetes, exposure to airborne fine particulate matter, and severe asthma. In addition, the results demonstrated specific differences across the three Translator Clinical Knowledge Sources, suggesting cohort- and/or environment-specific factors related to the services themselves or the catchment area from which each service derives patient data. Collectively, this special communication demonstrates the power and utility of intense, team-oriented hackathons and offers general technical, organizational, and scientific lessons learned.


Subject(s)
Asthma/physiopathology , Diabetes Mellitus/physiopathology , Environmental Exposure , Information Storage and Retrieval , Obesity/physiopathology , Particulate Matter/toxicity , Sex Factors , Asthma/complications , Female , Humans , Male , Obesity/complications , Severity of Illness Index
11.
Genes (Basel) ; 10(9)2019 08 25.
Article in English | MEDLINE | ID: mdl-31450660

ABSTRACT

The 2019 "Personal Genomes: Accessing, Sharing and Interpretation" conference (Hinxton, UK, 11-12 April 2019) brought together geneticists, bioinformaticians, clinicians and ethicists to promote openness and ethical sharing of personal genome data while protecting the privacy of individuals. The talks at the conference focused on two main topic areas: (1) Technologies and Applications, with emphasis on personal genomics in the context of healthcare. The issues discussed ranged from new technologies impacting and enabling the field, to the interpretation of personal genomes and their integration with other data types. There was particular emphasis and wide discussion on the use of polygenic risk scores to inform precision medicine. (2) Ethical, Legal, and Social Implications, with emphasis on genetic privacy: How to maintain it, how much privacy is possible, and how much privacy do people want? Talks covered the full range of genomic data visibility, from open access to tight control, and diverse aspects of balancing benefits and risks, data ownership, working with individuals and with populations, and promoting citizen science. Both topic areas were illustrated and informed by reports from a wide variety of ongoing projects, which highlighted the need to diversify global databases by increasing representation of understudied populations.


Subject(s)
Genetic Privacy/standards , Genome, Human , Genetic Privacy/ethics , Genetic Privacy/legislation & jurisprudence , Humans , Information Dissemination
12.
Front Genet ; 10: 400, 2019.
Article in English | MEDLINE | ID: mdl-31114611

ABSTRACT

Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.

14.
PLoS One ; 14(4): e0213013, 2019.
Article in English | MEDLINE | ID: mdl-30973881

ABSTRACT

Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.


Subject(s)
Big Data , Data Science/statistics & numerical data , Databases, Factual/statistics & numerical data , Algorithms , Humans , Information Dissemination , Longitudinal Studies , Software
15.
Proc Natl Acad Sci U S A ; 116(12): 5819-5827, 2019 03 19.
Article in English | MEDLINE | ID: mdl-30833390

ABSTRACT

Preterm birth (PTB) complications are the leading cause of long-term morbidity and mortality in children. By using whole blood samples, we integrated whole-genome sequencing (WGS), RNA sequencing (RNA-seq), and DNA methylation data for 270 PTB and 521 control families. We analyzed this combined dataset to identify genomic variants associated with PTB and secondary analyses to identify variants associated with very early PTB (VEPTB) as well as other subcategories of disease that may contribute to PTB. We identified differentially expressed genes (DEGs) and methylated genomic loci and performed expression and methylation quantitative trait loci analyses to link genomic variants to these expression and methylation changes. We performed enrichment tests to identify overlaps between new and known PTB candidate gene systems. We identified 160 significant genomic variants associated with PTB-related phenotypes. The most significant variants, DEGs, and differentially methylated loci were associated with VEPTB. Integration of all data types identified a set of 72 candidate biomarker genes for VEPTB, encompassing genes and those previously associated with PTB. Notably, PTB-associated genes RAB31 and RBPJ were identified by all three data types (WGS, RNA-seq, and methylation). Pathways associated with VEPTB include EGFR and prolactin signaling pathways, inflammation- and immunity-related pathways, chemokine signaling, IFN-γ signaling, and Notch1 signaling. Progress in identifying molecular components of a complex disease is aided by integrated analyses of multiple molecular data types and clinical data. With these data, and by stratifying PTB by subphenotype, we have identified associations between VEPTB and the underlying biology.


Subject(s)
Genetic Predisposition to Disease/genetics , Premature Birth/genetics , DNA Methylation/genetics , Female , Genomics/methods , Humans , Infant, Newborn , Male , Phenotype , Polymorphism, Single Nucleotide/genetics , Signal Transduction/genetics , Whole Genome Sequencing/methods
16.
Genes (Basel) ; 9(10)2018 Oct 04.
Article in English | MEDLINE | ID: mdl-30287784

ABSTRACT

Genetic testing has expanded out of the research laboratory into medical practice and the direct-to-consumer market. Rapid analysis of the resulting genotype data now has a significant impact. We present a method for summarizing personal genotypes as 'genotype fingerprints' that meets these needs. Genotype fingerprints can be derived from any single nucleotide polymorphism-based assay, and remain comparable as chip designs evolve to higher marker densities. We demonstrate that these fingerprints support distinguishing types of relationships among closely related individuals and closely related individuals from individuals from the same background population, as well as high-throughput identification of identical genotypes, individuals in known background populations, and de novo separation of subpopulations within a large cohort through extremely rapid comparisons. Although fingerprints do not preserve anonymity, they provide a useful degree of privacy by summarizing a genotype while preventing reconstruction of individual marker states. Genotype fingerprints are therefore well-suited as a format for public aggregation of genetic information to support ancestry and relatedness determination without revealing personal health risk status.

17.
Nat Genet ; 50(11): 1615, 2018 11.
Article in English | MEDLINE | ID: mdl-30291356

ABSTRACT

In the version of this article published, the P values for the enrichment of single mutation categories were inadvertently not corrected for multiple testing. After multiple-testing correction, only two of the six mutation categories mentioned are still statistically significant. To reflect this, the text "More specifically, paternally derived DNMs are enriched in transitions in A[.]G contexts, especially ACG>ATG and ATG>ACG (Bonferroni-corrected P = 1.3 × 10-2 and P = 1 × 10-3, respectively). Additionally, we observed overrepresentation of ATA>ACA mutations (Bonferroni-corrected P = 4.28 × 10-2) for DNMs of paternal origin. Among maternally derived DNMs, CCA>CTA, GCA>GTA and TCT>TGT mutations were significantly overrepresented (Bonferroni-corrected P = 4 × 10-4, P = 5 × 10-4, P = 1 × 10-3, respectively)" should read "More specifically, CCA>CTA and GCA>GTA mutations were significantly overenriched on the maternal allele (Bonferroni-corrected P = 0.0192 and P = 0.048, respectively)." Additionally, the last sentence to the legend for Fig. 3b should read "Green boxes highlight the mutation categories that differ significantly" instead of "Green boxes highlight the mutation categories that differ more than 1% of mutation load with a bootstrapping P value <0.05." Corrected versions of Fig. 3b and Supplementary Table 25 appear with the Author Correction.

19.
BMC Genomics ; 19(1): 528, 2018 Jul 11.
Article in English | MEDLINE | ID: mdl-29996771

ABSTRACT

BACKGROUND: Bacterial genomes have characteristic compositional skews, which are differences in nucleotide frequency between the leading and lagging DNA strands across a segment of a genome. It is thought that these strand asymmetries arise as a result of mutational biases and selective constraints, particularly for energy efficiency. Analysis of compositional skews in a diverse set of bacteria provides a comparative context in which mutational and selective environmental constraints can be studied. These analyses typically require finished and well-annotated genomic sequences. RESULTS: We present three novel metrics for examining genome composition skews; all three metrics can be computed for unfinished or partially-annotated genomes. The first two metrics, (dot-skew and cross-skew) depend on sequence and gene annotation of a single genome, while the third metric (residual skew) highlights unusual genomes by subtracting a GC content-based model of a library of genome sequences. We applied these metrics to 7738 available bacterial genomes, including partial drafts, and identified outlier species. A phylogenetically diverse set of these outliers (i.e., Borrelia, Ehrlichia, Kinetoplastibacterium, and Phytoplasma) display similar skew patterns but share lifestyle characteristics, such as intracellularity and biosynthetic dependence on their hosts. CONCLUSIONS: Our novel metrics appear to reflect the effects of biosynthetic constraints and adaptations to life within one or more hosts on genome composition. We provide results for each analyzed genome, software and interactive visualizations at http://db.systemsbiology.net/gestalt/ skew_metrics .


Subject(s)
Bacteria/genetics , Computational Biology/methods , Genome, Bacterial , Internet Access , Models, Genetic , User-Computer Interface
20.
PLoS One ; 13(6): e0198135, 2018.
Article in English | MEDLINE | ID: mdl-29889842

ABSTRACT

Lyme disease is caused by spirochaetes of the Borrelia burgdorferi sensu lato genospecies. Complete genome assemblies are available for fewer than ten strains of Borrelia burgdorferi sensu stricto, the primary cause of Lyme disease in North America. MM1 is a sensu stricto strain originally isolated in the midwestern United States. Aside from a small number of genes, the complete genome sequence of this strain has not been reported. Here we present the complete genome sequence of MM1 in relation to other sensu stricto strains and in terms of its Multi Locus Sequence Typing. Our results indicate that MM1 is a new sequence type which contains a conserved main chromosome and 15 plasmids. Our results include the first contiguous 28.5 kb assembly of lp28-8, a linear plasmid carrying the vls antigenic variation system, from a Borrelia burgdorferi sensu stricto strain.


Subject(s)
Borrelia burgdorferi/genetics , DNA, Bacterial/analysis , High-Throughput Nucleotide Sequencing/methods , Animals , Bacterial Typing Techniques , Borrelia burgdorferi/classification , Borrelia burgdorferi Group/genetics , Chromosome Mapping , Comparative Genomic Hybridization , Genetic Variation , Genome, Bacterial , Humans , Lyme Disease/microbiology , Multilocus Sequence Typing
SELECTION OF CITATIONS
SEARCH DETAIL
...