Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 334
Filtrar
1.
Nucleic Acids Res ; 52(D1): D1246-D1252, 2024 Jan 05.
Artigo em Inglês | MEDLINE | ID: mdl-37956338

RESUMO

Advancements in high-throughput technology offer researchers an extensive range of multi-omics data that provide deep insights into the complex landscape of cancer biology. However, traditional statistical models and databases are inadequate to interpret these high-dimensional data within a multi-omics framework. To address this limitation, we introduce DriverDBv4, an updated iteration of the DriverDB cancer driver gene database (http://driverdb.bioinfomics.org/). This updated version offers several significant enhancements: (i) an increase in the number of cohorts from 33 to 70, encompassing approximately 24 000 samples; (ii) inclusion of proteomics data, augmenting the existing types of omics data and thus expanding the analytical scope; (iii) implementation of multiple multi-omics algorithms for identification of cancer drivers; (iv) new visualization features designed to succinctly summarize high-context data and redesigned existing sections to accommodate the increased volume of datasets and (v) two new functions in Customized Analysis, specifically designed for multi-omics driver identification and subgroup expression analysis. DriverDBv4 facilitates comprehensive interpretation of multi-omics data across diverse cancer types, thereby enriching the understanding of cancer heterogeneity and aiding in the development of personalized clinical approaches. The database is designed to foster a more nuanced understanding of the multi-faceted nature of cancer.


Assuntos
Bases de Dados Genéticas , Multiômica , Neoplasias , Humanos , Algoritmos , Bases de Dados Genéticas/normas , Neoplasias/genética , Neoplasias/fisiopatologia
2.
Nucleic Acids Res ; 52(D1): D174-D182, 2024 Jan 05.
Artigo em Inglês | MEDLINE | ID: mdl-37962376

RESUMO

JASPAR (https://jaspar.elixir.no/) is a widely-used open-access database presenting manually curated high-quality and non-redundant DNA-binding profiles for transcription factors (TFs) across taxa. In this 10th release and 20th-anniversary update, the CORE collection has expanded with 329 new profiles. We updated three existing profiles and provided orthogonal support for 72 profiles from the previous release's UNVALIDATED collection. Altogether, the JASPAR 2024 update provides a 20% increase in CORE profiles from the previous release. A trimming algorithm enhanced profiles by removing low information content flanking base pairs, which were likely uninformative (within the capacity of the PFM models) for TFBS predictions and modelling TF-DNA interactions. This release includes enhanced metadata, featuring a refined classification for plant TFs' structural DNA-binding domains. The new JASPAR collections prompt updates to the genomic tracks of predicted TF binding sites (TFBSs) in 8 organisms, with human and mouse tracks available as native tracks in the UCSC Genome browser. All data are available through the JASPAR web interface and programmatically through its API and the updated Bioconductor and pyJASPAR packages. Finally, a new TFBS extraction tool enables users to retrieve predicted JASPAR TFBSs intersecting their genomic regions of interest.


Assuntos
Bases de Dados Genéticas , Ligação Proteica , Fatores de Transcrição , Animais , Humanos , Camundongos , Bases de Dados Genéticas/normas , Bases de Dados Genéticas/tendências , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Plantas/genética
4.
Genetics ; 220(4)2022 04 04.
Artigo em Inglês | MEDLINE | ID: mdl-35380658

RESUMO

The Alliance of Genome Resources (the Alliance) is a combined effort of 7 knowledgebase projects: Saccharomyces Genome Database, WormBase, FlyBase, Mouse Genome Database, the Zebrafish Information Network, Rat Genome Database, and the Gene Ontology Resource. The Alliance seeks to provide several benefits: better service to the various communities served by these projects; a harmonized view of data for all biomedical researchers, bioinformaticians, clinicians, and students; and a more sustainable infrastructure. The Alliance has harmonized cross-organism data to provide useful comparative views of gene function, gene expression, and human disease relevance. The basis of the comparative views is shared calls of orthology relationships and the use of common ontologies. The key types of data are alleles and variants, gene function based on gene ontology annotations, phenotypes, association to human disease, gene expression, protein-protein and genetic interactions, and participation in pathways. The information is presented on uniform gene pages that allow facile summarization of information about each gene in each of the 7 organisms covered (budding yeast, roundworm Caenorhabditis elegans, fruit fly, house mouse, zebrafish, brown rat, and human). The harmonized knowledge is freely available on the alliancegenome.org portal, as downloadable files, and by APIs. We expect other existing and emerging knowledge bases to join in the effort to provide the union of useful data and features that each knowledge base currently provides.


Assuntos
Bases de Dados Genéticas , Alelos , Animais , Caenorhabditis elegans/genética , Bases de Dados Genéticas/normas , Drosophila/genética , Ontologia Genética , Humanos , Internet , Camundongos/genética , Anotação de Sequência Molecular , Ratos/genética , Saccharomycetales/genética , Peixe-Zebra/genética
5.
Gene ; 814: 146154, 2022 Mar 10.
Artigo em Inglês | MEDLINE | ID: mdl-34995735

RESUMO

Transfer RNAs (tRNAs) are ancient molecules likely predating the translation machinery. These extremely conserved RNA molecules transfer amino acids to the ribosome for the synthesis of proteins encoded by mRNAs, but canonical tRNAs are not protein-coding RNAs. Surprisely, when virtually translated, I observed that peptides derived from tRNA sequences match thousands of protein entries in databases. The analysis of these sequences indicates that the vast majority of these tRNA-derived proteins are annotated as small hypothetical peptides, likely arising from sequencing, prediction and/or annotation errors. But life often surpasses fiction. Importantly, tRNA-encoded amino acid domains were also found embedded in large functional proteins. Phylogenetic analysis of representative tRNA-derived protein domains may provide new insights into the origin, plasticity, and evolution of protein-coding genes.


Assuntos
Proteínas/genética , RNA de Transferência , Bactérias/genética , Bases de Dados Genéticas/normas , Fungos/genética , Humanos , Plantas/genética , Domínios Proteicos/genética , RNA Bacteriano , RNA Fúngico , RNA de Plantas
6.
Genes (Basel) ; 12(10)2021 09 28.
Artigo em Inglês | MEDLINE | ID: mdl-34680918

RESUMO

Gene set analysis has been widely used to gain insight from high-throughput expression studies. Although various tools and methods have been developed for gene set analysis, there is no consensus among researchers regarding best practice(s). Most often, evaluation studies have reported contradictory recommendations of which methods are superior. Therefore, an unbiased quantitative framework for evaluations of gene set analysis methods will be valuable. Such a framework requires gene expression datasets where enrichment status of gene sets is known a priori. In the absence of such gold standard datasets, artificial datasets are commonly used for evaluations of gene set analysis methods; however, they often rely on oversimplifying assumptions that make them biased in favor of or against a given method. In this paper, we propose a quantitative framework for evaluation of gene set analysis methods by synthesizing expression datasets using real data, without relying on oversimplifying or unrealistic assumptions, while preserving complex gene-gene correlations and retaining the distribution of expression values. The utility of the quantitative approach is shown by evaluating ten widely used gene set analysis methods. An implementation of the proposed method is publicly available. We suggest using Silver to evaluate existing and new gene set analysis methods. Evaluation using Silver provides a better understanding of current methods and can aid in the development of gene set analysis methods to achieve higher specificity without sacrificing sensitivity.


Assuntos
Bases de Dados Genéticas/normas , Genômica/métodos , Software , Conjuntos de Dados como Assunto/normas
7.
Am J Hum Genet ; 108(10): 1813-1816, 2021 10 07.
Artigo em Inglês | MEDLINE | ID: mdl-34626580

RESUMO

The use of approved nomenclature in publications is vital to enable effective scientific communication and is particularly crucial when discussing genes of clinical relevance. Here, we discuss several examples of cases where the failure of researchers to use a HUGO Gene Nomenclature Committee (HGNC)-approved symbol in publications has led to confusion between unrelated human genes in the literature. We also inform authors of the steps they can take to ensure that they use approved nomenclature in their manuscripts and discuss how referencing HGNC IDs can remove ambiguity when referring to genes that have previously been published with confusing alias symbols.


Assuntos
Bases de Dados Genéticas/normas , Genes/genética , Genoma Humano , Pesquisadores/normas , Terminologia como Assunto , Genômica , Humanos
8.
Eur J Hum Genet ; 29(12): 1796-1803, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-34521998

RESUMO

Gene variant databases are the backbone of DNA-based diagnostics. These databases, also called Locus-Specific DataBases (LSDBs), store information on variants in the human genome and the observed phenotypic consequences. The largest collection of public databases uses the free, open-source LOVD software platform. To cope with the current demand for online databases, we have entirely redesigned the LOVD software. LOVD3 is genome-centered and can be used to store summary variant data, as well as full case-level data with information on individuals, phenotypes, screenings, and variants. While built on a standard core, the software is highly flexible and allows personalization to cope with the largely different demands of gene/disease database curators. LOVD3 follows current standards and includes tools to check variant descriptions, generate HTML files of reference sequences, predict the consequences of exon deletions/duplications on the reading frame, and link to genomic views in the different genomes browsers. It includes APIs to collect and submit data. The software is used by about 100 databases, of which 56 public LOVD instances are registered on our website and together contain 1,000,000,000 variant observations in 1,500,000 individuals. 42 LOVD instances share data with the federated LOVD data network containing 3,000,000 unique variants in 23,000 genes. This network can be queried directly, quickly identifying LOVD instances containing relevant information on a searched variant.


Assuntos
Bases de Dados Genéticas/normas , Polimorfismo Genético , Software , Predisposição Genética para Doença , Genoma Humano , Estudo de Associação Genômica Ampla/métodos , Humanos
9.
Genes (Basel) ; 12(6)2021 06 10.
Artigo em Inglês | MEDLINE | ID: mdl-34200671

RESUMO

Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney p = 6.15 × 10-76, r = 0.24; cohen's D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data.


Assuntos
Bases de Dados Genéticas/normas , Processamento de Linguagem Natural , RNA-Seq/métodos , Análise de Célula Única/métodos , Humanos , Anotação de Sequência Molecular/métodos , Especificidade de Órgãos
10.
BMC Cancer ; 21(1): 810, 2021 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-34266411

RESUMO

BACKGROUND: Bladder cancer (BC) is the ninth most common malignant tumor. We constructed a risk signature using immune-related gene pairs (IRGPs) to predict the prognosis of BC patients. METHODS: The mRNA transcriptome, simple nucleotide variation and clinical data of BC patients were downloaded from The Cancer Genome Atlas (TCGA) database (TCGA-BLCA). The mRNA transcriptome and clinical data were also extracted from Gene Expression Omnibus (GEO) datasets (GSE31684). A risk signature was built based on the IRGPs. The ability of the signature to predict prognosis was analyzed with survival curves and Cox regression. The relationships between immunological parameters [immune cell infiltration, immune checkpoints, tumor microenvironment (TME) and tumor mutation burden (TMB)] and the risk score were investigated. Finally, gene set enrichment analysis (GSEA) was used to explore molecular mechanisms underlying the risk score. RESULTS: The risk signature utilized 30 selected IRGPs. The prognosis of the high-risk group was significantly worse than that of the low-risk group. We used the GSE31684 dataset to validate the signature. Close relationships were found between the risk score and immunological parameters. Finally, GSEA showed that gene sets related to the extracellular matrix (ECM), stromal cells and epithelial-mesenchymal transition (EMT) were enriched in the high-risk group. In the low-risk group, we found a number of immune-related pathways in the enriched pathways and biofunctions. CONCLUSIONS: We used a new tool, IRGPs, to build a risk signature to predict the prognosis of BC. By evaluating immune parameters and molecular mechanisms, we gained a better understanding of the mechanisms underlying the risk signature. This signature can also be used as a tool to predict the effect of immunotherapy in patients with BC.


Assuntos
Bases de Dados Genéticas/normas , Regulação Neoplásica da Expressão Gênica/genética , Neoplasias da Bexiga Urinária/genética , Idoso , Humanos , Prognóstico , Análise de Sobrevida , Neoplasias da Bexiga Urinária/mortalidade
11.
Clin Pharmacol Ther ; 110(3): 563-572, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-34216021

RESUMO

Clinical annotations are one of the most popular resources available on the Pharmacogenomics Knowledgebase (PharmGKB). Each clinical annotation summarizes the association between variant-drug pairs, shows relevant findings from the curated literature, and is assigned a level of evidence (LOE) to indicate the strength of support for that association. Evidence from the pharmacogenomic literature is curated into PharmGKB as variant annotations, which can be used to create new clinical annotations or added to existing clinical annotations. This means that the same clinical annotation can be worked on by multiple curators over time. As more evidence is curated into PharmGKB, the task of maintaining consistency when assessing all the available evidence and assigning an LOE becomes increasingly difficult. To remedy this, a scoring system has been developed to automate LOE assignment to clinical annotations. Variant annotations are scored according to certain attributes, including study size, reported P value, and whether the variant annotation supports or fails to find an association. Clinical guidelines or US Food and Drug Administration (FDA)-approved drug labels which give variant-specific prescribing guidance are also scored. The scores of all annotations attached to a clinical annotation are summed together to give a total score for the clinical annotation, which is used to calculate an LOE. Overall, the system increases transparency, consistency, and reproducibility in LOE assignment to clinical annotations. In combination with increased standardization of how clinical annotations are written, use of this scoring system helps to ensure that PharmGKB clinical annotations continue to be a robust source of pharmacogenomic information.


Assuntos
Farmacogenética/normas , Medicina de Precisão/normas , Bases de Dados Genéticas/normas , Rotulagem de Medicamentos/normas , Prescrições de Medicamentos/normas , Humanos , Bases de Conhecimento , Medicamentos sob Prescrição/normas , Reprodutibilidade dos Testes
12.
PLoS Comput Biol ; 17(7): e1009113, 2021 07.
Artigo em Inglês | MEDLINE | ID: mdl-34228723

RESUMO

PCR amplification plays an integral role in the measurement of mixed microbial communities via high-throughput DNA sequencing of the 16S ribosomal RNA (rRNA) gene. Yet PCR is also known to introduce multiple forms of bias in 16S rRNA studies. Here we present a paired modeling and experimental approach to characterize and mitigate PCR NPM-bias (PCR bias from non-primer-mismatch sources) in microbiota surveys. We use experimental data from mock bacterial communities to validate our approach and human gut microbiota samples to characterize PCR NPM-bias under real-world conditions. Our results suggest that PCR NPM-bias can skew estimates of microbial relative abundances by a factor of 4 or more, but that this bias can be mitigated using log-ratio linear models.


Assuntos
Bactérias/genética , Bases de Dados Genéticas/normas , Microbioma Gastrointestinal/genética , Reação em Cadeia da Polimerase/normas , Viés , DNA Bacteriano/genética , Humanos
13.
Nat Microbiol ; 6(7): 946-959, 2021 07.
Artigo em Inglês | MEDLINE | ID: mdl-34155373

RESUMO

The accrual of genomic data from both cultured and uncultured microorganisms provides new opportunities to develop systematic taxonomies based on evolutionary relationships. Previously, we established a bacterial taxonomy through the Genome Taxonomy Database. Here, we propose a standardized archaeal taxonomy that is derived from a 122-concatenated-protein phylogeny that resolves polyphyletic groups and normalizes ranks based on relative evolutionary divergence. The resulting archaeal taxonomy, which forms part of the Genome Taxonomy Database, is stable for a range of phylogenetic variables including marker gene selection, inference methods, corrections for rate heterogeneity and compositional bias, tree rooting scenarios and expansion of the genome database. Rank normalization is shown to robustly correct for substitution rates varying up to 30-fold using simulated datasets. Taxonomic curation follows the rules of the International Code of Nomenclature of Prokaryotes while taking into account proposals to formally recognize the rank of phylum and to use genome sequences as type material. This taxonomy is based on 2,392 archaeal genomes, 93.3% of which required one or more changes to their existing taxonomy, mainly owing to incomplete classification. We identify 16 archaeal phyla and reclassify 3 major monophyletic units from the former Euryarchaeota and one phylum that unites the Thaumarchaeota-Aigarchaeota-Crenarchaeota-Korarchaeota (TACK) superphylum into a single phylum.


Assuntos
Archaea/classificação , Bases de Dados Genéticas , Genoma Arqueal , Archaea/genética , Bases de Dados Genéticas/normas , Evolução Molecular , Genômica , Filogenia , Padrões de Referência
15.
Cancer Biomark ; 30(4): 417-428, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33492284

RESUMO

BACKGROUND: Invasive breast cancer is a highly heterogeneous tumor, although there have been many prediction methods for invasive breast cancer risk prediction, the prediction effect is not satisfactory. There is an urgent need to develop a more accurate method to predict the prognosis of patients with invasive breast cancer. OBJECTIVE: To identify potential mRNAs and construct risk prediction models for invasive breast cancer based on bioinformaticsMETHODS: In this study, we investigated the differences in mRNA expression profiles between invasive breast cancer and normal breast samples, and constructed a risk model for the prediction of prognosis of invasive breast cancer with univariate and multivariate Cox analyses. RESULTS: We constructed a risk model comprising 8 mRNAs (PAX7, ZIC2, APOA5, TP53AIP1,MYBPH, USP41, DACT2, and POU3F2) for the prediction of invasive breast cancer prognosis. We used the 8-mRNA risk prediction model to divide 1076 samples into high-risk groups and low-risk groups, the Kaplan-Meier curve showed that the high-risk group was closely related to the poor prognosis of overall survival in patients with invasive breast cancer. The receiver operating characteristic curve revealed an area under the curve of 0.773 for the 8 mRNA model at 3-year overall survival, indicating that this model showed good specificity and sensitivity for prediction of prognosis of invasive breast cancer. CONCLUSIONS: The study provides an effective bioinformatic analysis for the better understanding of the molecular pathogenesis and prognosis risk assessment of invasive breast cancer.


Assuntos
Neoplasias da Mama/genética , Biologia Computacional/métodos , Bases de Dados Genéticas/normas , Perfilação da Expressão Gênica/métodos , RNA Mensageiro/genética , Neoplasias da Mama/mortalidade , Feminino , Humanos , Pessoa de Meia-Idade , Prognóstico , Análise de Sobrevida
16.
Brief Bioinform ; 22(1): 463-473, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-31885040

RESUMO

Small noncoding RNAs (sRNA/sncRNAs) are generated from different genomic loci and play important roles in biological processes, such as cell proliferation and the regulation of gene expression. Next-generation sequencing (NGS) has provided an unprecedented opportunity to discover and quantify diverse kinds of sncRNA, such as tRFs (tRNA-derived small RNA fragments), phasiRNAs (phased, secondary, small-interfering RNAs), Piwi-interacting RNA (piRNAs) and plant-specific 24-nt short interfering RNAs (siRNAs). However, currently available web-based tools do not provide approaches to comprehensively analyze all of these diverse sncRNAs. This study presents a novel integrated platform, sRNAtools (https://bioinformatics.caf.ac.cn/sRNAtools), that can be used in conjunction with high-throughput sequencing to identify and functionally annotate sncRNAs, including profiling microRNAss, piRNAs, tRNAs, small nuclear RNAs, small nucleolar RNAs and rRNAs and discovering isomiRs, tRFs, phasiRNAs and plant-specific 24-nt siRNAs for up to 21 model organisms. Different modules, including single case, batch case, group case and target case, are developed to provide users with flexible ways of studying sncRNA. In addition, sRNAtools supports different ways of uploading small RNA sequencing data in a very interactive queue system, while local versions based on the program package/Docker/virtureBox are also available. We believe that sRNAtools will greatly benefit the scientific community as an integrated tool for studying sncRNAs.


Assuntos
Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Pequeno RNA não Traduzido/genética , Software , Animais , Bases de Dados Genéticas/normas , Humanos , Pequeno RNA não Traduzido/química
17.
Brief Bioinform ; 22(1): 288-297, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-31998941

RESUMO

Circular RNAs (circRNAs) are covalently closed RNA molecules that have been linked to various diseases, including cancer. However, a precise function and working mechanism are lacking for the larger majority. Following many different experimental and computational approaches to identify circRNAs, multiple circRNA databases were developed as well. Unfortunately, there are several major issues with the current circRNA databases, which substantially hamper progression in the field. First, as the overlap in content is limited, a true reference set of circRNAs is lacking. This results from the low abundance and highly specific expression of circRNAs, and varying sequencing methods, data-analysis pipelines, and circRNA detection tools. A second major issue is the use of ambiguous nomenclature. Thus, redundant or even conflicting names for circRNAs across different databases contribute to the reproducibility crisis. Third, circRNA databases, in essence, rely on the position of the circRNA back-splice junction, whereas alternative splicing could result in circRNAs with different length and sequence. To uniquely identify a circRNA molecule, the full circular sequence is required. Fourth, circRNA databases annotate circRNAs' microRNA binding and protein-coding potential, but these annotations are generally based on presumed circRNA sequences. Finally, several databases are not regularly updated, contain incomplete data or suffer from connectivity issues. In this review, we present a comprehensive overview of the current circRNA databases and their content, features, and usability. In addition to discussing the current issues regarding circRNA databases, we come with important suggestions to streamline further research in this growing field.


Assuntos
Bases de Dados Genéticas/normas , RNA Circular/genética , Animais , Bases de Dados Genéticas/tendências , Genômica/métodos , Humanos , RNA Circular/química
18.
Brief Bioinform ; 22(1): 545-556, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-32026945

RESUMO

MOTIVATION: Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. RESULTS: We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. AVAILABILITY: http://bioconductor.org/packages/GSEABenchmarkeR. CONTACT: ludwig.geistlinger@sph.cuny.edu.


Assuntos
Perfilação da Expressão Gênica/métodos , Genômica/métodos , RNA-Seq/métodos , Animais , Benchmarking , Bases de Dados Genéticas/normas , Perfilação da Expressão Gênica/normas , Genômica/normas , Humanos , RNA-Seq/normas , Software
19.
Nucleic Acids Res ; 49(D1): D743-D750, 2021 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-33221926

RESUMO

Metagenomics became a standard strategy to comprehend the functional potential of microbial communities, including the human microbiome. Currently, the number of metagenomes in public repositories is increasing exponentially. The Sequence Read Archive (SRA) and the MG-RAST are the two main repositories for metagenomic data. These databases allow scientists to reanalyze samples and explore new hypotheses. However, mining samples from them can be a limiting factor, since the metadata available in these repositories is often misannotated, misleading, and decentralized, creating an overly complex environment for sample reanalysis. The main goal of the HumanMetagenomeDB is to simplify the identification and use of public human metagenomes of interest. HumanMetagenomeDB version 1.0 contains metadata of 69 822 metagenomes. We standardized 203 attributes, based on standardized ontologies, describing host characteristics (e.g. sex, age and body mass index), diagnosis information (e.g. cancer, Crohn's disease and Parkinson), location (e.g. country, longitude and latitude), sampling site (e.g. gut, lung and skin) and sequencing attributes (e.g. sequencing platform, average length and sequence quality). Further, HumanMetagenomeDB version 1.0 metagenomes encompass 58 countries, 9 main sample sites (i.e. body parts), 58 diagnoses and multiple ages, ranging from just born to 91 years old. The HumanMetagenomeDB is publicly available at https://webapp.ufz.de/hmgdb/.


Assuntos
Curadoria de Dados , Bases de Dados Genéticas/normas , Metadados/normas , Metagenoma , Humanos , Metagenômica , Padrões de Referência , Interface Usuário-Computador
20.
mSphere ; 5(6)2020 11 04.
Artigo em Inglês | MEDLINE | ID: mdl-33148820

RESUMO

Continued influx of metagenome-derived proteins with misannotated taxonomy into conventional databases, including RefSeq, threatens to eliminate the value of taxonomy identifiers. To prevent this, urgent efforts should be undertaken by submitters of metagenomic data sets as well as by database managers.


Assuntos
Bases de Dados Genéticas/normas , Metagenoma , Proteínas/genética , Algoritmos , Bases de Dados Genéticas/estatística & dados numéricos , Metagenômica/métodos , Metagenômica/normas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...