Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 10 de 10
Filter
1.
Nucleic Acids Res ; 42(Database issue): D975-9, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24297256

ABSTRACT

The Database of Genotypes and Phenotypes (dbGap, http://www.ncbi.nlm.nih.gov/gap) is a National Institutes of Health-sponsored repository charged to archive, curate and distribute information produced by studies investigating the interaction of genotype and phenotype. Information in dbGaP is organized as a hierarchical structure and includes the accessioned objects, phenotypes (as variables and datasets), various molecular assay data (SNP and Expression Array data, Sequence and Epigenomic marks), analyses and documents. Publicly accessible metadata about submitted studies, summary level data, and documents related to studies can be accessed freely on the dbGaP website. Individual-level data are accessible via Controlled Access application to scientists across the globe.


Subject(s)
Databases, Genetic , Genotype , Phenotype , Humans , Internet , National Library of Medicine (U.S.) , United States
2.
Viruses ; 16(3)2024 03 11.
Article in English | MEDLINE | ID: mdl-38543795

ABSTRACT

Genomic sequencing of clinical samples to identify emerging variants of SARS-CoV-2 has been a key public health tool for curbing the spread of the virus. As a result, an unprecedented number of SARS-CoV-2 genomes were sequenced during the COVID-19 pandemic, which allowed for rapid identification of genetic variants, enabling the timely design and testing of therapies and deployment of new vaccine formulations to combat the new variants. However, despite the technological advances of deep sequencing, the analysis of the raw sequence data generated globally is neither standardized nor consistent, leading to vastly disparate sequences that may impact identification of variants. Here, we show that for both Illumina and Oxford Nanopore sequencing platforms, downstream bioinformatic protocols used by industry, government, and academic groups resulted in different virus sequences from same sample. These bioinformatic workflows produced consensus genomes with differences in single nucleotide polymorphisms, inclusion and exclusion of insertions, and/or deletions, despite using the same raw sequence as input datasets. Here, we compared and characterized such discrepancies and propose a specific suite of parameters and protocols that should be adopted across the field. Consistent results from bioinformatic workflows are fundamental to SARS-CoV-2 and future pathogen surveillance efforts, including pandemic preparation, to allow for a data-driven and timely public health response.


Subject(s)
COVID-19 , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , COVID-19/epidemiology , Pandemics , Workflow , Computational Biology
3.
bioRxiv ; 2022 Nov 03.
Article in English | MEDLINE | ID: mdl-36380755

ABSTRACT

During the COVID-19 pandemic, SARS-CoV-2 surveillance efforts integrated genome sequencing of clinical samples to identify emergent viral variants and to support rapid experimental examination of genome-informed vaccine and therapeutic designs. Given the broad range of methods applied to generate new viral genomes, it is critical that consensus and variant calling tools yield consistent results across disparate pipelines. Here we examine the impact of sequencing technologies (Illumina and Oxford Nanopore) and 7 different downstream bioinformatic protocols on SARS-CoV-2 variant calling as part of the NIH Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) Tracking Resistance and Coronavirus Evolution (TRACE) initiative, a public-private partnership established to address the COVID-19 outbreak. Our results indicate that bioinformatic workflows can yield consensus genomes with different single nucleotide polymorphisms, insertions, and/or deletions even when using the same raw sequence input datasets. We introduce the use of a specific suite of parameters and protocols that greatly improves the agreement among pipelines developed by diverse organizations. Such consistency among bioinformatic pipelines is fundamental to SARS-CoV-2 and future pathogen surveillance efforts. The application of analysis standards is necessary to more accurately document phylogenomic trends and support data-driven public health responses.

4.
G3 (Bethesda) ; 9(8): 2447-2461, 2019 08 08.
Article in English | MEDLINE | ID: mdl-31151998

ABSTRACT

Inferring subject ancestry using genetic data is an important step in genetic association studies, required for dealing with population stratification. It has become more challenging to infer subject ancestry quickly and accurately since large amounts of genotype data, collected from millions of subjects by thousands of studies using different methods, are accessible to researchers from repositories such as the database of Genotypes and Phenotypes (dbGaP) at the National Center for Biotechnology Information (NCBI). Study-reported populations submitted to dbGaP are often not harmonized across studies or may be missing. Widely-used methods for ancestry prediction assume that most markers are genotyped in all subjects, but this assumption is unrealistic if one wants to combine studies that used different genotyping platforms. To provide ancestry inference and visualization across studies, we developed a new method, GRAF-pop, of ancestry prediction that is robust to missing genotypes and allows researchers to visualize predicted population structure in color and in three dimensions. When genotypes are dense, GRAF-pop is comparable in quality and running time to existing ancestry inference methods EIGENSTRAT, FastPCA, and FlashPCA2, all of which rely on principal components analysis (PCA). When genotypes are not dense, GRAF-pop gives much better ancestry predictions than the PCA-based methods. GRAF-pop employs basic geometric and probabilistic methods; the visualized ancestry predictions have a natural geometric interpretation, which is lacking in PCA-based methods. Since February 2018, GRAF-pop has been successfully incorporated into the dbGaP quality control process to identify inconsistencies between study-reported and computationally predicted populations and to provide harmonized population values in all new dbGaP submissions amenable to population prediction, based on marker genotypes. Plots, produced by GRAF-pop, of summary population predictions are available on dbGaP study pages, and the software, is available at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/Software.cgi.


Subject(s)
Databases, Genetic , Genetic Association Studies/methods , Software , Algorithms , Cluster Analysis , Genetics, Population , Genome-Wide Association Study , Humans , Principal Component Analysis , Reproducibility of Results
5.
PLoS One ; 12(6): e0179106, 2017.
Article in English | MEDLINE | ID: mdl-28609482

ABSTRACT

Genome-wide association studies (GWAS) usually rely on the assumption that different samples are not from closely related individuals. Detection of duplicates and close relatives becomes more difficult both statistically and computationally when one wants to combine datasets that may have been genotyped on different platforms. The dbGaP repository at the National Center of Biotechnology Information (NCBI) contains datasets from hundreds of studies with over one million samples. There are many duplicates and closely related individuals both within and across studies from different submitters. Relationships between studies cannot always be identified by the submitters of individual datasets. To aid in curation of dbGaP, we developed a rapid statistical method called Genetic Relationship and Fingerprinting (GRAF) to detect duplicates and closely related samples, even when the sets of genotyped markers differ and the DNA strand orientations are unknown. GRAF extracts genotypes of 10,000 informative and independent SNPs from genotype datasets obtained using different methods, and implements quick algorithms that enable it to find all of the duplicate pairs from more than 880,000 samples within and across dbGaP studies in less than two hours. In addition, GRAF uses two statistical metrics called All Genotype Mismatch Rate (AGMR) and Homozygous Genotype Mismatch Rate (HGMR) to determine subject relationships directly from the observed genotypes, without estimating probabilities of identity by descent (IBD), or kinship coefficients, and compares the predicted relationships with those reported in the pedigree files. We implemented GRAF in a freely available C++ program of the same name. In this paper, we describe the methods in GRAF and validate the usage of GRAF on samples from the dbGaP repository. Other scientists can use GRAF on their own samples and in combination with samples downloaded from dbGaP.


Subject(s)
Algorithms , Computational Biology/methods , Data Mining/methods , Databases, Nucleic Acid/statistics & numerical data , Genome-Wide Association Study/statistics & numerical data , Polymorphism, Single Nucleotide , Genome-Wide Association Study/methods , Genotype , Humans , Reproducibility of Results
6.
Comp Biochem Physiol B Biochem Mol Biol ; 144(3): 290-300, 2006 Jul.
Article in English | MEDLINE | ID: mdl-16725360

ABSTRACT

A quantification method was developed to determine the concentrations of the major antifreeze glycoprotein (AFGP) isoforms in the blood of Antarctic notothenioid fishes. Serum samples were precipitated with 2.5% TCA and the supernatant containing AFGPs were chromatographed on an HPLC size exclusion column and the concentrations of the major AFGP size classes were determined from the areas of the corresponding peaks in the elution profile. Eight species of Antarctic notothenioid fishes were examined and their blood AFGP concentrations varied from 5 to 35 mg/mL. All of these fishes synthesized both the large and small AFGPs, but maintained higher levels of small AFGPs than the large ones in their blood. The species inhabiting more severe water environments (lower temperature and presence of ice) had higher serum AFGP levels than those in milder environments. The cryopelagic Pagothenia borchgrevinki decreased their blood AFGP concentrations in response to warm acclimation, but to a much lower extent in comparison to the antifreeze-bearing fishes in the Northern Hemisphere. After being warm acclimated at +4 degrees C for 16 weeks, the serum concentrations of the small and large AFGPs were decreased by about 60% and 20%, respectively.


Subject(s)
Acclimatization , Antifreeze Proteins/metabolism , Antifreeze Proteins/physiology , Perciformes/physiology , Temperature , Animals , Antarctic Regions , Chromatography, Gel , Chromatography, High Pressure Liquid/methods , Environment , Fishes/physiology , Ions/blood , Osmolar Concentration , Perciformes/blood , Reproducibility of Results
7.
Proteins ; 61 Suppl 7: 167-175, 2005.
Article in English | MEDLINE | ID: mdl-16187359

ABSTRACT

Natively disordered proteins or protein segments are those without stable secondary or tertiary structure in the absence of binding partners. Such disordered regions often are important functional sites in many biological processes, especially those involved in transcription, translation, and cell signaling. The prediction of such regions is therefore of great importance in focusing experimental efforts on regions of proteins that may be critical for function. In CASP6, held in 2004, twenty research groups participated in the prediction of disordered regions. Both binary predictions (ordered or disordered) and assigned scores for disorder were assessed. Several groups performed quite well in predicting regions of disorder in the X-ray and NMR structures available to the assessors. The best of these groups performed better than the best groups in CASP5, held in 2002.


Subject(s)
Computational Biology/methods , Proteomics/methods , Algorithms , Computer Simulation , Computers , Databases, Protein , Humans , Models, Molecular , Protein Conformation , Protein Folding , Protein Structure, Secondary , Protein Structure, Tertiary , Reproducibility of Results , Sequence Alignment , Software
8.
Proteins ; 61 Suppl 7: 46-66, 2005.
Article in English | MEDLINE | ID: mdl-16187346

ABSTRACT

The Sixth Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP6) held in December 2004 focused on the prediction of the structures of 90 protein domains from 64 targets. Thirty-eight of these were classified as "fold recognition," defined as being similar in fold to proteins of known structure at the time of submission of the predictions. Only the "first" predictions and those longer than 20 amino acids for each domain were assessed, resulting in 4527 predictions from 165 groups. The assessment was accomplished by the use of six structure alignment programs and three scoring measures based on these alignments. The use of a variety of measures resulted in scoring insensitive to the peculiarities of any one alignment method. The top-ranked methods in the prediction of structures that were clearly homologous to proteins in the Protein Data Bank primarily used servers and other programs based on achieving a consensus of many remote homology detection and fold recognition methods. The top-ranked methods in prediction of structures less clearly related or unrelated to proteins of known structures used fragment building methods in addition to the fold recognition meta methods.


Subject(s)
Computational Biology/methods , Proteomics/methods , Algorithms , Data Interpretation, Statistical , Databases, Protein , Models, Molecular , Models, Statistical , Protein Conformation , Protein Folding , Protein Structure, Tertiary , Reproducibility of Results , Software
9.
Nat Genet ; 39(10): 1181-6, 2007 Oct.
Article in English | MEDLINE | ID: mdl-17898773

ABSTRACT

The National Center for Biotechnology Information has created the dbGaP public repository for individual-level phenotype, exposure, genotype and sequence data and the associations between them. dbGaP assigns stable, unique identifiers to studies and subsets of information from those studies, including documents, individual phenotypic variables, tables of trait data, sets of genotype data, computed phenotype-genotype associations, and groups of study subjects who have given similar consents for use of their data.


Subject(s)
Databases, Genetic , Genotype , Phenotype , Computational Biology , Databases, Factual , National Library of Medicine (U.S.)/organization & administration , United States
10.
Mol Biol Evol ; 20(11): 1897-908, 2003 Nov.
Article in English | MEDLINE | ID: mdl-12885956

ABSTRACT

The fish fauna of the Antarctic Ocean is dominated by five endemic families of the Perciform suborder Notothenioidei, thought to have arisen in situ within the Antarctic through adaptive radiation of an ancestral stock that evolved antifreeze glycoproteins (AFGPs) enabling survival as the ocean chilled to subzero temperatures. The endemism results from geographic confinement imposed by a massive oceanographic barrier, the Antarctic Circumpolar Current, which also thermally isolated Antarctica over geologic time, leading to its current frigid condition. Despite this voluminous barrier to fish dispersal, a number of species from the Antarctic family Nototheniidae now inhabit the nonfreezing cool temperate coasts of the southern continents. The origin of these temperate-water nototheniids is not completely understood. Since the AFGP gene apparently evolved only once, before the Antarctic notothenioid radiation, the presence of AFGP genes in extant temperate-water nototheniids can be used to infer an Antarctic evolutionary origin. Genomic Southern analysis, PCR amplification of AFGP genes, and sequencing showed that Notothenia angustata and Notothenia microlepidota endemic to southern New Zealand have two to three AFGP genes, structurally the same as those of the Antarctic nototheniids. At least one of these genes is still functional, as AFGP cDNAs were obtained and low levels of mature AFGPs were detected in the blood. A phylogenetic tree based on complete ND2 coding sequences showed monophyly of these two New Zealand nototheniids and their inclusion in the monophyletic Nototheniidae consisted of mostly AFGP-bearing taxa. These analyses support an Antarctic ancestry for the New Zealand nototheniids. A divergence time of approximately 11 Myr was estimated for the two New Zealand nototheniids, approximating the upper Miocene northern advance of the Antarctic Convergence over New Zealand, which might have served as the vicariant event that lead to the northward dispersal of their most recent common ancestor. Similar secondary northward dispersal likely applies to the South American nototheniid Paranotothenia magellanica, which has four AFGP genes in its DNA, but not to the sympatric nototheniid Patagonotothen tessellata, which does not appear to have any AFGP sequences in its genome at all.


Subject(s)
Antifreeze Proteins/genetics , Amino Acid Sequence , Animals , Antarctic Regions , Base Sequence , Blotting, Southern , Cloning, Molecular , DNA, Complementary/metabolism , Enzyme-Linked Immunosorbent Assay , Evolution, Molecular , Fishes , Molecular Sequence Data , New Zealand , Phylogeny , Polymerase Chain Reaction , Reverse Transcriptase Polymerase Chain Reaction , Temperature , Water
SELECTION OF CITATIONS
SEARCH DETAIL