Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 10 de 10
Filter
Add more filters










Publication year range
1.
Pac Symp Biocomput ; 29: 433-445, 2024.
Article in English | MEDLINE | ID: mdl-38160297

ABSTRACT

The incompleteness of race and ethnicity information in real-world data (RWD) hampers its utility in promoting healthcare equity. This study introduces two methods-one heuristic and the other machine learning-based-to impute race and ethnicity from genetic ancestry using tumor profiling data. Analyzing de-identified data from over 100,000 cancer patients sequenced with the Tempus xT panel, we demonstrate that both methods outperform existing geolocation and surname-based methods, with the machine learning approach achieving high recall (range: 0.859-0.993) and precision (range: 0.932-0.981) across four mutually exclusive race and ethnicity categories. This work presents a novel pathway to enhance RWD utility in studying racial disparities in healthcare.


Subject(s)
Ethnicity , Names , Humans , Ethnicity/genetics , Racial Groups/genetics , Computational Biology , Genetic Testing
2.
Bioinform Adv ; 3(1): vbad062, 2023.
Article in English | MEDLINE | ID: mdl-37416509

ABSTRACT

Summary: RNA sequencing (RNA-seq) can be applied to diverse tasks including quantifying gene expression, discovering quantitative trait loci and identifying gene fusion events. Although RNA-seq can detect germline variants, the complexities of variable transcript abundance, target capture and amplification introduce challenging sources of error. Here, we extend DeepVariant, a deep-learning-based variant caller, to learn and account for the unique challenges presented by RNA-seq data. Our DeepVariant RNA-seq model produces highly accurate variant calls from RNA-sequencing data, and outperforms existing approaches such as Platypus and GATK. We examine factors that influence accuracy, how our model addresses RNA editing events and how additional thresholding can be used to facilitate our models' use in a production pipeline. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

3.
NPJ Genom Med ; 3: 2, 2018.
Article in English | MEDLINE | ID: mdl-29354287

ABSTRACT

Next-generation deep sequencing of gene panels is being adopted as a diagnostic test to identify actionable mutations in cancer patient samples. However, clinical samples, such as formalin-fixed, paraffin-embedded specimens, frequently provide low quantities of degraded, poor quality DNA. To overcome these issues, many sequencing assays rely on extensive PCR amplification leading to an accumulation of bias and artifacts. Thus, there is a need for a targeted sequencing assay that performs well with DNA of low quality and quantity without relying on extensive PCR amplification. We evaluate the performance of a targeted sequencing assay based on Oligonucleotide Selective Sequencing, which permits the enrichment of genes and regions of interest and the identification of sequence variants from low amounts of damaged DNA. This assay utilizes a repair process adapted to clinical FFPE samples, followed by adaptor ligation to single stranded DNA and a primer-based capture technique. Our approach generates sequence libraries of high fidelity with reduced reliance on extensive PCR amplification-this facilitates the accurate assessment of copy number alterations in addition to delivering accurate single nucleotide variant and insertion/deletion detection. We apply this method to capture and sequence the exons of a panel of 130 cancer-related genes, from which we obtain high read coverage uniformity across the targeted regions at starting input DNA amounts as low as 10 ng per sample. We demonstrate the performance using a series of reference DNA samples, and by identifying sequence variants in DNA from matched clinical samples originating from different tissue types.

4.
Immunity ; 43(6): 1199-211, 2015 Dec 15.
Article in English | MEDLINE | ID: mdl-26682989

ABSTRACT

Respiratory viral infections are a significant burden to healthcare worldwide. Many whole genome expression profiles have identified different respiratory viral infection signatures, but these have not translated to clinical practice. Here, we performed two integrated, multi-cohort analyses of publicly available transcriptional data of viral infections. First, we identified a common host signature across different respiratory viral infections that could distinguish (1) individuals with viral infections from healthy controls and from those with bacterial infections, and (2) symptomatic from asymptomatic subjects prior to symptom onset in challenge studies. Second, we identified an influenza-specific host response signature that (1) could distinguish influenza-infected samples from those with bacterial and other respiratory viral infections, (2) was a diagnostic and prognostic marker in influenza-pneumonia patients and influenza challenge studies, and (3) was predictive of response to influenza vaccine. Our results have applications in the diagnosis, prognosis, and identification of drug targets in viral infections.


Subject(s)
Respiratory Tract Infections/diagnosis , Respiratory Tract Infections/genetics , Transcriptome , Virus Diseases/diagnosis , Virus Diseases/genetics , Cohort Studies , Datasets as Topic , Humans
5.
Arthritis Res Ther ; 17: 262, 2015 Sep 21.
Article in English | MEDLINE | ID: mdl-26387933

ABSTRACT

INTRODUCTION: In the present study, we sought to identify markers in patients with anti-neutrophil cytoplasmic antibody (ANCA)-associated vasculitis (AAV) that distinguish those achieving remission at 6 months following rituximab or cyclophosphamide treatment from those for whom treatment failed in the Rituximab in ANCA-Associated Vasculitis (RAVE) trial. METHODS: Clinical and flow cytometry data from the RAVE trial were downloaded from the Immunology Database and Analysis Portal and Immune Tolerance Network TrialShare public repositories. Flow cytometry data were analyzed using validated automated gating and joined with clinical data. Lymphocyte and granulocyte populations were measured in patients who achieved or failed to achieve remission. RESULTS: There was no difference in lymphocyte subsets and treatment outcome with either treatment. We defined a Granularity Index (GI) that measures the difference between the percentage of hypergranular and hypogranular granulocytes. We found that rituximab-treated patients who achieved remission had a significantly higher GI at baseline than those who did not (p = 0.0085) and that this pattern was reversed in cyclophosphamide-treated patients (p = 0.037). We defined optimal cutoff values of the GI using the Youden index. Cyclophosphamide was superior to rituximab in inducing remission in patients with GI below -9.25% (67% vs. 30%, respectively; p = 0.033), whereas rituximab was superior to cyclophosphamide for patients with GI greater than 47.6% (83% vs. 33%, respectively; p = 0.0002). CONCLUSIONS: We identified distinct subsets of granulocytes found at baseline in patients with AAV that predicted whether they were more likely to achieve remission with cyclophosphamide or rituximab. Profiling patients on the basis of the GI may lead to more successful trials and therapeutic courses in AAV. TRIAL REGISTRATION: ClinicalTrials.gov identifier (for original study from which data were obtained): NCT00104299 . Date of registration: 24 February 2005.


Subject(s)
Anti-Neutrophil Cytoplasmic Antibody-Associated Vasculitis/drug therapy , Anti-Neutrophil Cytoplasmic Antibody-Associated Vasculitis/immunology , Biomarkers/blood , Granulocytes/immunology , Immunologic Factors/therapeutic use , Rituximab/therapeutic use , Female , Flow Cytometry , Humans , Male , Middle Aged , Remission Induction , Treatment Outcome
6.
J Am Med Inform Assoc ; 22(6): 1148-52, 2015 Nov.
Article in English | MEDLINE | ID: mdl-26112029

ABSTRACT

The Center for Expanded Data Annotation and Retrieval is studying the creation of comprehensive and expressive metadata for biomedical datasets to facilitate data discovery, data interpretation, and data reuse. We take advantage of emerging community-based standard templates for describing different kinds of biomedical datasets, and we investigate the use of computational techniques to help investigators to assemble templates and to fill in their values. We are creating a repository of metadata from which we plan to identify metadata patterns that will drive predictive data entry when filling in metadata templates. The metadata repository not only will capture annotations specified when experimental datasets are initially created, but also will incorporate links to the published literature, including secondary analyses and possible refinements or retractions of experimental interpretations. By working initially with the Human Immunology Project Consortium and the developers of the ImmPort data repository, we are developing and evaluating an end-to-end solution to the problems of metadata authoring and management that will generalize to other data-management environments.


Subject(s)
Biomedical Research , Data Mining , Datasets as Topic , Biological Ontologies , Humans , Information Storage and Retrieval , United States
7.
Genome Med ; 2(8): 51, 2010 Aug 06.
Article in English | MEDLINE | ID: mdl-20691073

ABSTRACT

With the continued exponential expansion of publicly available genomic data and access to low-cost, high-throughput molecular technologies for profiling patient populations, computational technologies and informatics are becoming vital considerations in genomic medicine. Although cloud computing technology is being heralded as a key enabling technology for the future of genomic research, available case studies are limited to applications in the domain of high-throughput sequence data analysis. The goal of this study was to evaluate the computational and economic characteristics of cloud computing in performing a large-scale data integration and analysis representative of research problems in genomic medicine. We find that the cloud-based analysis compares favorably in both performance and cost in comparison to a local computational cluster, suggesting that cloud computing technologies might be a viable resource for facilitating large-scale translational research in genomic medicine.

9.
BMC Bioinformatics ; 8: 244, 2007 Jul 10.
Article in English | MEDLINE | ID: mdl-17623104

ABSTRACT

BACKGROUND: Using computational database searches, we have demonstrated previously that no gene sequences could be found for at least 36% of enzyme activities that have been assigned an Enzyme Commission number. Here we present a follow-up literature-based survey involving a statistically significant sample of such "orphan" activities. The survey was intended to determine whether sequences for these enzyme activities are truly unknown, or whether these sequences are absent from the public sequence databases but can be found in the literature. RESULTS: We demonstrate that for ~80% of sampled orphans, the absence of sequence data is bona fide. Our analyses further substantiate the notion that many of these enzyme activities play biologically important roles. CONCLUSION: This survey points toward significant scientific cost of having such a large fraction of characterized enzyme activities disconnected from sequence data. It also suggests that a larger effort, beginning with a comprehensive survey of all putative orphan activities, would resolve nearly 300 artifactual orphans and reconnect a wealth of enzyme research with modern genomics. For these reasons, we propose that a systematic effort to identify the cognate genes of orphan enzymes be undertaken.


Subject(s)
Computational Biology/methods , Data Collection , Enzymes/classification , Enzymes/genetics , Databases, Factual , Databases, Genetic , Databases, Protein , Enzymes/metabolism , Genomics , Proteomics , Reproducibility of Results , Species Specificity
10.
BMC Bioinformatics ; 7: 170, 2006 Mar 23.
Article in English | MEDLINE | ID: mdl-16556315

ABSTRACT

BACKGROUND: This article addresses the problem of interoperation of heterogeneous bioinformatics databases. RESULTS: We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research. CONCLUSION: BioWarehouse embodies significant progress on the database integration problem for bioinformatics.


Subject(s)
Computational Biology/methods , Database Management Systems , Databases, Factual , Databases, Genetic , Databases, Protein , Protein Engineering/methods , Proteins/chemistry , Proteins/genetics , Semantics , Signal Transduction/genetics , Software
SELECTION OF CITATIONS
SEARCH DETAIL
...