Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 54
Filtrar
Mais filtros

Bases de dados
Tipo de documento
Intervalo de ano de publicação
1.
Nat Methods ; 21(6): 994-1002, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38755321

RESUMO

Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects (runs or assemblies) that have short sequence matches to a user query, with well-defined guarantees and ranks them using informativeness of the matches. We illustrate the functionality of Pebblescout by creating eight databases that index over 3.7 petabases. The web service of Pebblescout can be reached at https://pebblescout.ncbi.nlm.nih.gov . We show that for a wide range of query lengths, Pebblescout provides a data-driven way for finding relevant subsets of large nucleotide resources, reducing the effort for downstream analysis substantially. We also show that Pebblescout results compare favorably to MetaGraph and Sourmash.


Assuntos
Software , Nucleotídeos/genética , Humanos , Bases de Dados Genéticas , Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos , Algoritmos
2.
BMC Bioinformatics ; 22(1): 375, 2021 Jul 21.
Artigo em Inglês | MEDLINE | ID: mdl-34289805

RESUMO

BACKGROUND: Illumina is the dominant sequencing technology at this time. Short length, short insert size, some systematic biases, and low-level carryover contamination in Illumina reads continue to make assembly of repeated regions a challenging problem. Some applications also require finding multiple well supported variants for assembled regions. RESULTS: To facilitate assembly of repeat regions and to report multiple well supported variants when a user can provide target sequences to assist the assembly, we propose SAUTE and SAUTE_PROT assemblers. Both assemblers use de Bruijn graph on reads. Targets can be transcripts or proteins for RNA-seq reads and transcripts, proteins, or genomic regions for genomic reads. Target sequences are nucleotide and protein sequences for SAUTE and SAUTE_PROT, respectively. CONCLUSIONS: For RNA-seq, comparisons with TRINITY, RNASPADES, SPALIGNER, and SPADES assembly of reads aligned to target proteins by DIAMOND show that SAUTE_PROT finds more coding sequences that translate to benchmark proteins. Using AMRFINDERPLUS calls, we find SAUTE has higher sensitivity and precision than SPADES, PLASMIDSPADES, SPALIGNER, and SPADES assembly of reads aligned to target regions by HISAT2. It also has better sensitivity than SKESA but worse precision.


Assuntos
Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Algoritmos , Genoma , RNA-Seq , Análise de Sequência de DNA
3.
Nucleic Acids Res ; 47(D1): D23-D28, 2019 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-30395293

RESUMO

The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 38 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. New resources released in the past year include PubMed Labs and a new sequence database search. Resources that were updated in the past year include PubMed, PMC, Bookshelf, genome data viewer, Assembly, prokaryotic genomes, Genome, BioProject, dbSNP, dbVar, BLAST databases, igBLAST, iCn3D and PubChem. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.


Assuntos
Biotecnologia/organização & administração , Bases de Dados Genéticas , Animais , Biotecnologia/métodos , Bases de Dados de Compostos Químicos , Humanos , Software , Estados Unidos/epidemiologia , Navegador
4.
Genome Res ; 24(12): 2066-76, 2014 12.
Artigo em Inglês | MEDLINE | ID: mdl-25373144

RESUMO

A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.


Assuntos
Genoma Humano , Haplótipos , Mola Hidatiforme/genética , Alelos , Mapeamento Cromossômico , Cromossomos Artificiais Bacterianos , Biologia Computacional/métodos , Feminino , Genômica/métodos , Heterozigoto , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Polimorfismo de Nucleotídeo Único , Gravidez , Sequências Repetitivas de Ácido Nucleico , Duplicações Segmentares Genômicas , Análise de Sequência de DNA
5.
BMC Genomics ; 17: 37, 2016 Jan 07.
Artigo em Inglês | MEDLINE | ID: mdl-26742787

RESUMO

BACKGROUND: Xiphophorus fishes are represented by 26 live-bearing species of tropical fish that express many attributes (e.g., viviparity, genetic and phenotypic variation, ecological adaptation, varied sexual developmental mechanisms, ability to produce fertile interspecies hybrids) that have made attractive research models for over 85 years. Use of various interspecies hybrids to investigate the genetics underlying spontaneous and induced tumorigenesis has resulted in the development and maintenance of pedigreed Xiphophorus lines specifically bred for research. The recent availability of the X. maculatus reference genome assembly now provides unprecedented opportunities for novel and exciting comparative research studies among Xiphophorus species. RESULTS: We present sequencing, assembly and annotation of two new genomes representing Xiphophorus couchianus and Xiphophorus hellerii. The final X. couchianus and X. hellerii assemblies have total sizes of 708 Mb and 734 Mb and correspond to 98 % and 102 % of the X. maculatus Jp 163 A genome size, respectively. The rates of single nucleotide change range from 1 per 52 bp to 1 per 69 bp among the three genomes and the impact of putatively damaging variants are presented. In addition, a survey of transposable elements allowed us to deduce an ancestral TE landscape, uncovered potential active TEs and document a recent burst of TEs during evolution of this genus. CONCLUSIONS: Two new Xiphophorus genomes and their corresponding transcriptomes were efficiently assembled, the former using a novel guided assembly approach. Three assembled genome sequences within this single vertebrate order of new world live-bearing fishes will accelerate our understanding of relationship between environmental adaptation and genome evolution. In addition, these genome resources provide capability to determine allele specific gene regulation among interspecies hybrids produced by crossing any of the three species that are known to produce progeny predisposed to tumor development.


Assuntos
Ciprinodontiformes/genética , Variação Genética , Genoma , Transcriptoma/genética , Animais , Regulação da Expressão Gênica , Genômica , Especificidade da Espécie
6.
Gastroenterology ; 149(1): 67-78, 2015 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-25865046

RESUMO

BACKGROUND & AIMS: Small intestinal carcinoids are rare and difficult to diagnose and patients often present with advanced incurable disease. Although the disease occurs sporadically, there have been reports of family clusters. Hereditary small intestinal carcinoid has not been recognized and genetic factors have not been identified. We performed a genetic analysis of families with small intestinal carcinoids to establish a hereditary basis and find genes that might cause this cancer. METHODS: We performed a prospective study of 33 families with at least 2 cases of small intestinal carcinoids. Affected members were characterized clinically and asymptomatic relatives were screened and underwent exploratory laparotomy for suspected tumors. Disease-associated mutations were sought using linkage analysis, whole-exome sequencing, and copy number analyses of germline and tumor DNA collected from members of a single large family. We assessed expression of mutant protein, protein activity, and regulation of apoptosis and senescence in lymphoblasts derived from the cases. RESULTS: Familial and sporadic carcinoids are clinically indistinguishable except for the multiple synchronous primary tumors observed in most familial cases. Nearly 34% of asymptomatic relatives older than age 50 were found to have occult tumors; the tumors were cleared surgically from 87% of these individuals (20 of 23). Linkage analysis and whole-exome sequencing identified a germline 4-bp deletion in the gene inositol polyphosphate multikinase (IPMK), which truncates the protein. This mutation was detected in all 11 individuals with small intestinal carcinoids and in 17 of 35 family members whose carcinoid status was unknown. Mutant IPMK had reduced kinase activity and nuclear localization, compared with the full-length protein. This reduced activation of p53 and increased cell survival. CONCLUSIONS: We found that small intestinal carcinoids can occur as an inherited autosomal-dominant disease. The familial form is characterized by multiple synchronous primary tumors, which might account for 22%-35% of cases previously considered sporadic. Relatives of patients with familial carcinoids should be screened to detect curable early stage disease. IPMK haploinsufficiency promotes carcinoid tumorigenesis.


Assuntos
Tumor Carcinoide/genética , Mutação em Linhagem Germinativa , Neoplasias Intestinais/genética , Fosfotransferases (Aceptor do Grupo Álcool)/genética , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Tumor Carcinoide/diagnóstico , Tumor Carcinoide/patologia , Família , Feminino , Humanos , Neoplasias Intestinais/diagnóstico , Neoplasias Intestinais/patologia , Laparotomia , Masculino , Pessoa de Meia-Idade , Linhagem , Estudos Prospectivos , Adulto Jovem
7.
medRxiv ; 2024 May 16.
Artigo em Inglês | MEDLINE | ID: mdl-38903069

RESUMO

Whole-genome sequencing of bacterial pathogens is used by public health agencies to link cases of food poisoning caused by the same source of contamination. The vast majority of these appear to be sporadic cases associated with small contamination episodes and do not trigger investigations. We analyzed clusters of sequenced clinical isolates of Salmonella, Escherichia coli, Campylobacter, and Listeria that differ by only a small number of mutations to provide a new understanding of the underlying contamination episodes. These analyses provide new evidence that the youngest age groups have greater susceptibility to infection from Salmonella, Escherichia coli, and Campylobacter than older age groups. This age bias is weaker for the common Salmonella serovar Enteritidis than Salmonella in general. Analysis of these clusters reveals significant regional variations in relative frequencies of Salmonella serovars across the United States. A large fraction of the contamination episodes causing sickness appear to have long duration. For example, 50% of the Salmonella cases are in clusters that persist for almost three years. For all four pathogen species, the majority of the cases were part of genetic clusters with illnesses in multiple states and likely to be caused by contaminated commercially distributed foods. The vast majority of Salmonella cases among infants < 6 months of age appear to be caused by cross-contamination from foods consumed by older age groups or by environmental bacteria rather than infant formula contaminated at production sites.

8.
PLoS One ; 19(1): e0291406, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38241320

RESUMO

Candida auris is a newly emerged multidrug-resistant fungus capable of causing invasive infections with high mortality. Despite intense efforts to understand how this pathogen rapidly emerged and spread worldwide, its environmental reservoirs are poorly understood. Here, we present a collaborative effort between the U.S. Centers for Disease Control and Prevention, the National Center for Biotechnology Information, and GridRepublic (a volunteer computing platform) to identify C. auris sequences in publicly available metagenomic datasets. We developed the MetaNISH pipeline that uses SRPRISM to align sequences to a set of reference genomes and computes a score for each reference genome. We used MetaNISH to scan ~300,000 SRA metagenomic runs from 2010 onwards and identified five datasets containing C. auris reads. Finally, GridRepublic has implemented a prospective C. auris molecular monitoring system using MetaNISH and volunteer computing.


Assuntos
Candida , Candidíase , Humanos , Candida/genética , Candidíase/microbiologia , Candida auris , Estudos Prospectivos , Metagenômica , Antifúngicos/uso terapêutico
9.
Immunogenetics ; 65(10): 749-62, 2013 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-23925440

RESUMO

We report on the analyses of genes encoding immunoglobulin heavy and light chains in the rabbit 6.51× whole genome assembly. This OryCun2.0 assembly confirms previous mapping of the duplicated IGK1 and IGK2 loci to chromosome 2 and the IGL lambda light chain locus to chromosome 21. The most frequently rearranged and expressed IGHV1 that is closest to IG DH and IGHJ genes encodes rabbit VHa allotypes. The partially inbred Thorbecke strain rabbit used for whole-genome sequencing was homozygous at the IGK but heterozygous with the IGHV1a1 allele in one of 79 IGHV-containing unplaced scaffolds and IGHV1a2, IGHM, IGHG, and IGHE sequences in another. Some IGKV, IGLV, and IGHA genes are also in other unplaced scaffolds. By fluorescence in situ hybridization, we assigned the previously unmapped IGH locus to the q-telomeric region of rabbit chromosome 20. An approximately 3-Mb segment of human chromosome 14 including IGH genes predicted to map to this telomeric region based on synteny analysis could not be located on assembled chromosome 20. Unplaced scaffold chrUn0053 contains some of the genes that comparative mapping predicts to be missing. We identified discrepancies between previous targeted studies and the OryCun2.0 assembly and some new BAC clones with IGH sequences that can guide other studies to further sequence and improve the OryCun2.0 assembly. Complete knowledge of gene sequences encoding variable regions of rabbit heavy, kappa, and lambda chains will lead to better understanding of how and why rabbits produce antibodies of high specificity and affinity through gene conversion and somatic hypermutation.


Assuntos
Cromossomos de Mamíferos/genética , Biologia Computacional/métodos , Genoma , Cadeias Pesadas de Imunoglobulinas/genética , Imunoglobulinas/genética , Animais , Mapeamento Cromossômico , Cromossomos Artificiais Bacterianos/genética , Feminino , Humanos , Alótipos de Imunoglobulina/sangue , Alótipos de Imunoglobulina/genética , Região Variável de Imunoglobulina/genética , Cadeias kappa de Imunoglobulina/genética , Cadeias lambda de Imunoglobulina/genética , Hibridização in Situ Fluorescente , Masculino , Coelhos , Reprodutibilidade dos Testes
10.
Nat Genet ; 32(1): 175-9, 2002 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-12185364

RESUMO

The disorder Amish microcephaly (MCPHA) is characterized by severe congenital microcephaly, elevated levels of alpha-ketoglutarate in the urine and premature death. The disorder is inherited in an autosomal recessive pattern and has been observed only in Old Order Amish families whose ancestors lived in Lancaster County, Pennsylvania. Here we show, by using a genealogy database and automated pedigree software, that 23 nuclear families affected with MCPHA are connected to a single ancestral couple. Through a whole-genome scan, fine mapping and haplotype analysis, we localized the gene affected in MCPHA to a region of 3 cM, or 2 Mb, on chromosome 17q25. We constructed a map of contiguous genomic clones spanning this region. One of the genes in this region, SLC25A19, which encodes a nuclear mitochondrial deoxynucleotide carrier (DNC), contains a substitution that segregates with the disease in affected individuals and alters an amino acid that is highly conserved in similar proteins. Functional analysis shows that the mutant DNC protein lacks the normal transport activity, implying that failed deoxynucleotide transport across the inner mitochondrial membrane causes MCPHA. Our data indicate that mitochondrial deoxynucleotide transport may be essential for prenatal brain growth.


Assuntos
Proteínas de Transporte/genética , Desoxirribonucleotídeos/metabolismo , Proteínas de Membrana Transportadoras , Microcefalia/genética , Proteínas de Transporte/metabolismo , Cristianismo , Cromossomos Humanos Par 17 , Clonagem Molecular , Escherichia coli , Etnicidade , Feminino , Marcadores Genéticos , Haplótipos , Humanos , Escore Lod , Masculino , Proteínas de Transporte da Membrana Mitocondrial , Mutação , Linhagem , Mapeamento Físico do Cromossomo , Proteínas Recombinantes/genética , Proteínas Recombinantes/metabolismo
11.
PLoS Biol ; 7(5): e1000112, 2009 May 05.
Artigo em Inglês | MEDLINE | ID: mdl-19468303

RESUMO

The mouse (Mus musculus) is the premier animal model for understanding human disease and development. Here we show that a comprehensive understanding of mouse biology is only possible with the availability of a finished, high-quality genome assembly. The finished clone-based assembly of the mouse strain C57BL/6J reported here has over 175,000 fewer gaps and over 139 Mb more of novel sequence, compared with the earlier MGSCv3 draft genome assembly. In a comprehensive analysis of this revised genome sequence, we are now able to define 20,210 protein-coding genes, over a thousand more than predicted in the human genome (19,042 genes). In addition, we identified 439 long, non-protein-coding RNAs with evidence for transcribed orthologs in human. We analyzed the complex and repetitive landscape of 267 Mb of sequence that was missing or misassembled in the previously published assembly, and we provide insights into the reasons for its resistance to sequencing and assembly by whole-genome shotgun approaches. Duplicated regions within newly assembled sequence tend to be of more recent ancestry than duplicates in the published draft, correcting our initial understanding of recent evolution on the mouse lineage. These duplicates appear to be largely composed of sequence regions containing transposable elements and duplicated protein-coding genes; of these, some may be fixed in the mouse population, but at least 40% of segmentally duplicated sequences are copy number variable even among laboratory mouse strains. Mouse lineage-specific regions contain 3,767 genes drawn mainly from rapidly-changing gene families associated with reproductive functions. The finished mouse genome assembly, therefore, greatly improves our understanding of rodent-specific biology and allows the delineation of ancestral biological functions that are shared with human from derived functions that are not.


Assuntos
Biologia Computacional/métodos , Genoma/genética , Animais , Bases de Dados Genéticas , Duplicação Gênica , Genoma/fisiologia , Humanos , Camundongos
12.
J Food Prot ; 85(5): 755-772, 2022 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-35259246

RESUMO

ABSTRACT: This multiagency report developed by the Interagency Collaboration for Genomics for Food and Feed Safety provides an overview of the use of and transition to whole genome sequencing (WGS) technology for detection and characterization of pathogens transmitted commonly by food and for identification of their sources. We describe foodborne pathogen analysis, investigation, and harmonization efforts among the following federal agencies: National Institutes of Health; Department of Health and Human Services, Centers for Disease Control and Prevention (CDC) and U.S. Food and Drug Administration (FDA); and the U.S. Department of Agriculture, Food Safety and Inspection Service, Agricultural Research Service, and Animal and Plant Health Inspection Service. We describe single nucleotide polymorphism, core-genome, and whole genome multilocus sequence typing data analysis methods as used in the PulseNet (CDC) and GenomeTrakr (FDA) networks, underscoring the complementary nature of the results for linking genetically related foodborne pathogens during outbreak investigations while allowing flexibility to meet the specific needs of Interagency Collaboration partners. We highlight how we apply WGS to pathogen characterization (virulence and antimicrobial resistance profiles) and source attribution efforts and increase transparency by making the sequences and other data publicly available through the National Center for Biotechnology Information. We also highlight the impact of current trends in the use of culture-independent diagnostic tests for human diagnostic testing on analytical approaches related to food safety and what is next for the use of WGS in the area of food safety.


Assuntos
Doenças Transmitidas por Alimentos , Animais , Surtos de Doenças/prevenção & controle , Inocuidade dos Alimentos , Doenças Transmitidas por Alimentos/epidemiologia , Doenças Transmitidas por Alimentos/prevenção & controle , Genômica , Estados Unidos , Sequenciamento Completo do Genoma
13.
Neurogenetics ; 12(3): 223-32, 2011 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-21643798

RESUMO

We recently reported autosomal recessive fetal-onset neuroaxonal dystrophy (FNAD) in a large family of dogs that is not caused by mutation in the PLA2G6 locus (Fyfe et al., J Comp Neurol 518:3771-3784, 2010). Here, we report a genome-wide linkage analysis using 333 microsatellite markers to map canine FNAD to the telomeric end of chromosome 2. The interval of zero recombination was refined by single-nucleotide polymorphism (SNP) haplotype analysis to ~200 kb, and the included genes were sequenced. We found a homozygous 3-nucleotide deletion in exon 14 of mitofusin 2 (MFN2), predicting loss of a glutamate residue at position 539 in the protein of affected dogs. RT-PCR demonstrated near normal expression of the mutant mRNA, but MFN2 expression was undetectable to very low on western blots of affected dog brainstem, cerebrum, kidney, and cultured fibroblasts and by immunohistochemistry on brainstem sections. MFN2 is a multifunctional, membrane-bound GTPase of mitochondria and endoplasmic reticulum most commonly associated with human Charcot-Marie-Tooth disease type 2A2. The canine disorder extends the range of MFN2-associated phenotypes and suggests MFN2 as a candidate gene for rare cases of human FNAD.


Assuntos
Doenças do Cão/genética , Doenças Fetais/genética , GTP Fosfo-Hidrolases/genética , Proteínas de Membrana/genética , Proteínas Mitocondriais/genética , Mutação , Distrofias Neuroaxonais/genética , Idade de Início , Sequência de Aminoácidos , Animais , Doenças do Cão/epidemiologia , Cães , Família , Doenças Fetais/epidemiologia , Doenças Fetais/veterinária , GTP Fosfo-Hidrolases/fisiologia , Proteínas de Membrana/fisiologia , Proteínas Mitocondriais/fisiologia , Dados de Sequência Molecular , Mutação/fisiologia , Distrofias Neuroaxonais/epidemiologia , Distrofias Neuroaxonais/veterinária , Linhagem , Polimorfismo de Nucleotídeo Único/fisiologia , Homologia de Sequência de Aminoácidos
14.
Nucleic Acids Res ; 37(3): 815-24, 2009 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-19088134

RESUMO

Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.


Assuntos
Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Bases de Dados de Proteínas
15.
Nucleic Acids Res ; 37(Database issue): D216-23, 2009 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-18940865

RESUMO

Rapid increases in DNA sequencing capabilities have led to a vast increase in the data generated from prokaryotic genomic studies, which has been a boon to scientists studying micro-organism evolution and to those who wish to understand the biological underpinnings of microbial systems. The NCBI Protein Clusters Database (ProtClustDB) has been created to efficiently maintain and keep the deluge of data up to date. ProtClustDB contains both curated and uncurated clusters of proteins grouped by sequence similarity. The May 2008 release contains a total of 285 386 clusters derived from over 1.7 million proteins encoded by 3806 nt sequences from the RefSeq collection of complete chromosomes and plasmids from four major groups: prokaryotes, bacteriophages and the mitochondrial and chloroplast organelles. There are 7180 clusters containing 376 513 proteins with curated gene and protein functional annotation. PubMed identifiers and external cross references are collected for all clusters and provide additional information resources. A suite of web tools is available to explore more detailed information, such as multiple alignments, phylogenetic trees and genomic neighborhoods. ProtClustDB provides an efficient method to aggregate gene and protein annotation for researchers and is available at http://www.ncbi.nlm.nih.gov/sites/entrez?db=proteinclusters.


Assuntos
Bases de Dados de Proteínas , Proteínas/classificação , Análise por Conglomerados , Genômica , Proteínas/química , Proteínas/genética , Homologia de Sequência de Aminoácidos
16.
BMC Med Genet ; 11: 68, 2010 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-20433770

RESUMO

BACKGROUND: Because they are a closed founder population, the Old Order Amish (OOA) of Lancaster County have been the subject of many medical genetics studies. We constructed four versions of Anabaptist Genealogy Database (AGDB) using three sources of genealogies and multiple updates. In addition, we developed PedHunter, a suite of query software that can solve pedigree-related problems automatically and systematically. METHODS: We report on how we have used new features in PedHunter to quantify the number and expected genetic contribution of founders to the OOA. The queries and utility of PedHunter programs are illustrated by examples using AGDB in this paper. For example, we calculated the number of founders expected to be contributing genetic material to the present-day living OOA and estimated the mean relative founder representation for each founder. New features in PedHunter also include pedigree trimming and pedigree renumbering, which should prove useful for studying large pedigrees. RESULTS: With PedHunter version 2.0 querying AGDB version 4.0, we identified 34,160 presumed living OOA individuals and connected them into a 14-generation pedigree descending from 554 founders (332 females and 222 males) after trimming. From the analysis of cumulative mean relative founder representation, 128 founders (78 females and 50 males) accounted for over 95% of the mean relative founder contribution among living OOA descendants. DISCUSSION/CONCLUSIONS: The OOA are a closed founder population in which a modest number of founders account for the genetic variation present in the current OOA population. Improvements to the PedHunter software will be useful in future studies of both the OOA and other populations with large and computerized genealogies.


Assuntos
Efeito Fundador , Protestantismo , Mapeamento Cromossômico , Bases de Dados Genéticas , Feminino , Genealogia e Heráldica , Humanos , Masculino , Casamento , Modelos Genéticos , Linhagem , Pennsylvania
17.
Genomics ; 93(4): 299-304, 2009 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-18951970

RESUMO

We describe the construction of a high-resolution radiation hybrid (RH) map of the domestic cat genome, which includes 2662 markers, translating to an estimated average intermarker distance of 939 kilobases (kb). Targeted marker selection utilized the recent feline 1.9x genome assembly, concentrating on regions of low marker density on feline autosomes and the X chromosome, in addition to regions flanking interspecies chromosomal breakpoints. Average gap (breakpoint) size between cat-human ordered conserved segments is less than 900 kb. The map was used for a fine-scale comparison of conserved syntenic blocks with the human and canine genomes. Corroborative fluorescence in situ hybridization (FISH) data were generated using 129 domestic cat BAC clones as probes, providing independent confirmation of the long-range correctness of the map. Cross-species hybridization of BAC probes on divergent felids from the genera Profelis (serval) and Panthera (snow leopard) provides further evidence for karyotypic conservation within felids, and demonstrates the utility of such probes for future studies of chromosome evolution within the cat family and in related carnivores. The integrated map constitutes a comprehensive framework for identifying genes controlling feline phenotypes of interest, and to aid in assembly of a higher coverage feline genome sequence.


Assuntos
Gatos/genética , Felidae/genética , Genoma , Hibridização in Situ Fluorescente/métodos , Filogenia , Mapeamento de Híbridos Radioativos/métodos , Animais , Marcadores Genéticos , Genótipo , Sintenia/genética
18.
Gigascience ; 9(4)2020 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-32315028

RESUMO

BACKGROUND: Alignment of sequence reads generated by next-generation sequencing is an integral part of most pipelines analyzing next-generation sequencing data. A number of tools designed to quickly align a large volume of sequences are already available. However, most existing tools lack explicit guarantees about their output. They also do not support searching genome assemblies, such as the human genome assembly GRCh38, that include primary and alternate sequences and placement information for alternate sequences to primary sequences in the assembly. FINDINGS: This paper describes SRPRISM (Single Read Paired Read Indel Substitution Minimizer), an alignment tool for aligning reads without splices. SRPRISM has features not available in most tools, such as (i) support for searching genome assemblies with alternate sequences, (ii) partial alignment of reads with a specified region of reads to be included in the alignment, (iii) choice of ranking schemes for alignments, and (iv) explicit criteria for search sensitivity. We compare the performance of SRPRISM to GEM, Kart, STAR, BWA-MEM, Bowtie2, Hobbes, and Yara using benchmark sets for paired and single reads of lengths 100 and 250 bp generated using DWGSIM. SRPRISM found the best results for most benchmark sets with error rate of up to ∼2.5% and GEM performed best for higher error rates. SRPRISM was also more sensitive than other tools even when sensitivity was reduced to improve run time performance. CONCLUSIONS: We present SRPRISM as a flexible read mapping tool that provides explicit guarantees on results.


Assuntos
Genoma Humano/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Mutação INDEL/genética , Alinhamento de Sequência/métodos , Algoritmos , Humanos , Análise de Sequência de DNA , Software
19.
Bioinformatics ; 24(16): 1757-64, 2008 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-18567917

RESUMO

MOTIVATION: The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar. RESULTS: We developed a new version of the MegaBLAST module of BLAST that does the initial phase of finding short seeds for matches by searching a database index. We also developed a program makembindex that preprocesses the database into a data structure for rapid seed searching. We show that the new 'indexed MegaBLAST' is faster than the 'non-indexed' version for most practical uses. We show that indexed MegaBLAST is faster than miBLAST, another implementation of BLAST nucleotide searching with a preprocessed database, for most of the 200 queries we tested. To deploy indexed MegaBLAST as part of NCBI'sWeb BLAST service, the storage of databases and the queueing mechanism were modified, so that some machines are now dedicated to serving queries for a specific database. The response time for such Web queries is now faster than it was when each computer handled queries for multiple databases. AVAILABILITY: The code for indexed MegaBLAST is part of the blastn program in the NCBI C++ toolkit. The preprocessor program makembindex is also in the toolkit. Indexed MegaBLAST has been used in production on NCBI's Web BLAST service to search one version of the human and mouse genomes since October 2007. The Linux command-line executables for blastn and makembindex, documentation, and some query sets used to carry out the tests described below are available in the directory: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast [corrected] SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados de Proteínas , Armazenamento e Recuperação da Informação/métodos , Proteínas/química , Análise de Sequência de Proteína/métodos , Software , Interface Usuário-Computador , Sequência de Aminoácidos , Dados de Sequência Molecular , Alinhamento de Sequência/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA