Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Nucleic Acids Res ; 52(W1): W45-W53, 2024 Jul 05.
Artigo em Inglês | MEDLINE | ID: mdl-38749504

RESUMO

ChIP-Atlas (https://chip-atlas.org/) presents a suite of data-mining tools for analyzing epigenomic landscapes, powered by the comprehensive integration of over 376 000 public ChIP-seq, ATAC-seq, DNase-seq and Bisulfite-seq experiments from six representative model organisms. To unravel the intricacies of chromatin architecture that mediates the regulome-initiated generation of transcriptional and phenotypic diversity within cells, we report ChIP-Atlas 3.0 that enhances clarity by incorporating additional tracks for genomic and epigenomic features within a newly consolidated 'annotation track' section. The tracks include chromosomal conformation (Hi-C and eQTL datasets), transcriptional regulatory elements (ChromHMM and FANTOM5 enhancers), and genomic variants associated with diseases and phenotypes (GWAS SNPs and ClinVar variants). These annotation tracks are easily accessible alongside other experimental tracks, facilitating better elucidation of chromatin architecture underlying the diversification of transcriptional and phenotypic traits. Furthermore, 'Diff Analysis,' a new online tool, compares the query epigenome data to identify differentially bound, accessible, and methylated regions using ChIP-seq, ATAC-seq and DNase-seq, and Bisulfite-seq datasets, respectively. The integration of annotation tracks and the Diff Analysis tool, coupled with continuous data expansion, renders ChIP-Atlas 3.0 a robust resource for mining the landscape of transcriptional regulatory mechanisms, thereby offering valuable perspectives, particularly for genetic disease research and drug discovery.


Assuntos
Sequenciamento de Cromatina por Imunoprecipitação , Mineração de Dados , Software , Humanos , Mineração de Dados/métodos , Sequenciamento de Cromatina por Imunoprecipitação/métodos , Animais , Cromatina/genética , Cromatina/metabolismo , Cromossomos/genética , Epigenômica/métodos , Polimorfismo de Nucleotídeo Único , Camundongos , Locos de Características Quantitativas , Anotação de Sequência Molecular , Elementos Reguladores de Transcrição/genética , Genômica/métodos
2.
Nucleic Acids Res ; 50(W1): W175-W182, 2022 07 05.
Artigo em Inglês | MEDLINE | ID: mdl-35325188

RESUMO

ChIP-Atlas (https://chip-atlas.org) is a web service providing both GUI- and API-based data-mining tools to reveal the architecture of the transcription regulatory landscape. ChIP-Atlas is powered by comprehensively integrating all data sets from high-throughput ChIP-seq and DNase-seq, a method for profiling chromatin regions accessible to DNase. In this update, we further collected all the ATAC-seq and whole-genome bisulfite-seq data for six model organisms (human, mouse, rat, fruit fly, nematode, and budding yeast) with the latest genome assemblies. These together with ChIP-seq data can be visualized with the Peak Browser tool and a genome browser to explore the epigenomic landscape of a query genomic locus, such as its chromatin accessibility, DNA methylation status, and protein-genome interactions. This epigenomic landscape can also be characterized for multiple genes and genomic loci by querying with the Enrichment Analysis tool, which, for example, revealed that inflammatory bowel disease-associated SNPs are the most significantly hypo-methylated in neutrophils. Therefore, ChIP-Atlas provides a panoramic view of the whole epigenomic landscape. All datasets are free to download via either a simple button on the web page or an API.


Assuntos
Sequenciamento de Cromatina por Imunoprecipitação , Epigenômica , Animais , Humanos , Mineração de Dados , Epigenômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Modelos Animais , Atlas como Assunto , Bases de Dados como Assunto
3.
Allergol Int ; 73(2): 255-263, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38102028

RESUMO

BACKGROUND: In clinical research on multifactorial diseases such as atopic dermatitis, data-driven medical research has become more widely used as means to clarify diverse pathological conditions and to realize precision medicine. However, modern clinical data, characterized as large-scale, multimodal, and multi-center, causes difficulties in data integration and management, which limits productivity in clinical data science. METHODS: We designed a generic data management flow to collect, cleanse, and integrate data to handle different types of data generated at multiple institutions by 10 types of clinical studies. We developed MeDIA (Medical Data Integration Assistant), a software to browse the data in an integrated manner and extract subsets for analysis. RESULTS: MeDIA integrates and visualizes data and information on research participants obtained from multiple studies. It then provides a sophisticated interface that supports data management and helps data scientists retrieve the data sets they need. Furthermore, the system promotes the use of unified terms such as identifiers or sampling dates to reduce the cost of pre-processing by data analysts. We also propose best practices in clinical data management flow, which we learned from the development and implementation of MeDIA. CONCLUSIONS: The MeDIA system solves the problem of multimodal clinical data integration, from complex text data such as medical records to big data such as omics data from a large number of patients. The system and the proposed best practices can be applied not only to allergic diseases but also to other diseases to promote data-driven medical research.


Assuntos
Pesquisa Biomédica , Dermatite Atópica , Humanos , Dermatite Atópica/diagnóstico , Dermatite Atópica/terapia , Gerenciamento de Dados , Medicina de Precisão
4.
Bioinformatics ; 38(17): 4194-4199, 2022 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-35801937

RESUMO

MOTIVATION: Understanding life cannot be accomplished without making full use of biological data, which are scattered across databases of diverse categories in life sciences. To connect such data seamlessly, identifier (ID) conversion plays a key role. However, existing ID conversion services have disadvantages, such as covering only a limited range of biological categories of databases, not keeping up with the updates of the original databases and outputs being hard to interpret in the context of biological relations, especially when converting IDs in multiple steps. RESULTS: TogoID is an ID conversion service implementing unique features with an intuitive web interface and an application programming interface (API) for programmatic access. TogoID currently supports 65 datasets covering various biological categories. TogoID users can perform exploratory multistep conversions to find a path among IDs. To guide the interpretation of biological meanings in the conversions, we crafted an ontology that defines the semantics of the dataset relations. AVAILABILITY AND IMPLEMENTATION: The TogoID service is freely available on the TogoID website (https://togoid.dbcls.jp/) and the API is also provided to allow programmatic access. To encourage developers to add new dataset pairs, the system stores the configurations of pairs at the GitHub repository (https://github.com/togoid/togoid-config) and accepts the request of additional pairs. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Gerenciamento de Dados , Software , Bases de Dados Factuais
5.
EMBO Rep ; 19(12)2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30413482

RESUMO

We have fully integrated public chromatin chromatin immunoprecipitation sequencing (ChIP-seq) and DNase-seq data (n > 70,000) derived from six representative model organisms (human, mouse, rat, fruit fly, nematode, and budding yeast), and have devised a data-mining platform-designated ChIP-Atlas (http://chip-atlas.org). ChIP-Atlas is able to show alignment and peak-call results for all public ChIP-seq and DNase-seq data archived in the NCBI Sequence Read Archive (SRA), which encompasses data derived from GEO, ArrayExpress, DDBJ, ENCODE, Roadmap Epigenomics, and the scientific literature. All peak-call data are integrated to visualize multiple histone modifications and binding sites of transcriptional regulators (TRs) at given genomic loci. The integrated data can be further analyzed to show TR-gene and TR-TR interactions, as well as to examine enrichment of protein binding for given multiple genomic coordinates or gene names. ChIP-Atlas is superior to other platforms in terms of data number and functionality for data mining across thousands of ChIP-seq experiments, and it provides insight into gene regulatory networks and epigenetic mechanisms.


Assuntos
Imunoprecipitação da Cromatina , Mineração de Dados , Análise de Sequência de DNA , Animais , Elementos Facilitadores Genéticos/genética , Loci Gênicos , Humanos , Internet , Fatores de Transcrição/metabolismo
6.
J Plant Res ; 131(4): 709-717, 2018 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-29460198

RESUMO

Recent studies have shown that environmental DNA is found almost everywhere. Flower petal surfaces are an attractive tissue to use for investigation of the dispersal of environmental DNA in nature as they are isolated from the external environment until the bud opens and only then can the petal surface accumulate environmental DNA. Here, we performed a crowdsourced experiment, the "Ohanami Project", to obtain environmental DNA samples from petal surfaces of Cerasus × yedoensis 'Somei-yoshino' across the Japanese archipelago during spring 2015. C. × yedoensis is the most popular garden cherry species in Japan and clones of this cultivar bloom simultaneously every spring. Data collection spanned almost every prefecture and totaled 577 DNA samples from 149 collaborators. Preliminary amplicon-sequencing analysis showed the rapid attachment of environmental DNA onto the petal surfaces. Notably, we found DNA of other common plant species in samples obtained from a wide distribution; this DNA likely originated from the pollen of the Japanese cedar. Our analysis supports our belief that petal surfaces after blossoming are a promising target to reveal the dynamics of environmental DNA in nature. The success of our experiment also shows that crowdsourced environmental DNA analyses have considerable value in ecological studies.


Assuntos
DNA de Plantas/genética , DNA/genética , Meio Ambiente , Flores/genética , Prunus/genética , Cloroplastos/genética , Cianobactérias/genética , Flores/microbiologia , Japão , Proteobactérias/genética , Prunus/microbiologia , Alinhamento de Sequência , Análise de Sequência de DNA
7.
Nucleic Acids Res ; 44(11): 5010-21, 2016 06 20.
Artigo em Inglês | MEDLINE | ID: mdl-27131787

RESUMO

Predicting responsible transcription regulators on the basis of transcriptome data is one of the most promising computational approaches to understanding cellular processes and characteristics. Here, we present a novel method employing vast amounts of chromatin immunoprecipitation (ChIP) experimental data to address this issue. Global high-throughput ChIP data was collected to construct a comprehensive database, containing 8 578 738 binding interactions of 454 transcription regulators. To incorporate information about heterogeneous frequencies of transcription factor (TF)-binding events, we developed a flexible framework for gene set analysis employing the weighted t-test procedure, namely weighted parametric gene set analysis (wPGSA). Using transcriptome data as an input, wPGSA predicts the activities of transcription regulators responsible for observed gene expression. Validation of wPGSA with published transcriptome data, including that from over-expressed TFs, showed that the method can predict activities of various TFs, regardless of cell type and conditions, with results totally consistent with biological observations. We also applied wPGSA to other published transcriptome data and identified potential key regulators of cell reprogramming and influenza virus pathogenesis, generating compelling hypotheses regarding underlying regulatory mechanisms. This flexible framework will contribute to uncovering the dynamic and robust architectures of biological regulation, by incorporating high-throughput experimental data in the form of weights.


Assuntos
Sítios de Ligação , Imunoprecipitação da Cromatina , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Fatores de Transcrição/metabolismo , Transcriptoma , Algoritmos , Animais , Análise por Conglomerados , Bases de Dados Genéticas , Humanos , Camundongos , Ligação Proteica , Reprodutibilidade dos Testes
8.
Biosci Microbiota Food Health ; 43(4): 336-341, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39364125

RESUMO

Depression is a prevalent mental health disorder, and its incidence has increased further because of the coronavirus disease 2019 (COVID-19) pandemic. The gut microbiome has been suggested as a potential target for mental health treatment because of the bidirectional communication system between the brain and gastrointestinal tract, known as the gut-brain axis. We aimed to investigate the relationship between the human gut microbiome and depression screening by analyzing the abundance and types of microbiomes among individuals living in Japan, where mental health awareness and support may differ from those in other countries owing to cultural factors. We used a data-driven approach to evaluate the gut microbiome of participants who underwent commercial gut microbiota testing services and completed a questionnaire survey that included a test for scoring depressive tendencies. Our data analysis results indicated that no significant differences in gut microbiome composition were found among the groups based on their depression screening scores. However, the results also indicated the potential existence of a few differentially abundant bacterial taxa. Specifically, the detected bacterial changes in abundance suggest that the Bifidobacteriaceae, Streptococcaceae, and Veillonellaceae families are candidates for differentially abundant bacteria. Our findings should contribute to the growing body of research on the relationship between gut microbiome and mental health, highlighting the potential of microbiome-based interventions for depression treatment. The limitations of this study include the lack of clear medical information on the participants' diagnoses. Future research could benefit from a larger sample size and more detailed clinical information.

9.
PLoS One ; 19(9): e0309210, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39255315

RESUMO

Recording the provenance of scientific computation results is key to the support of traceability, reproducibility and quality assessment of data products. Several data models have been explored to address this need, providing representations of workflow plans and their executions as well as means of packaging the resulting information for archiving and sharing. However, existing approaches tend to lack interoperable adoption across workflow management systems. In this work we present Workflow Run RO-Crate, an extension of RO-Crate (Research Object Crate) and Schema.org to capture the provenance of the execution of computational workflows at different levels of granularity and bundle together all their associated objects (inputs, outputs, code, etc.). The model is supported by a diverse, open community that runs regular meetings, discussing development, maintenance and adoption aspects. Workflow Run RO-Crate is already implemented by several workflow management systems, allowing interoperable comparisons between workflow runs from heterogeneous systems. We describe the model, its alignment to standards such as W3C PROV, and its implementation in six workflow systems. Finally, we illustrate the application of Workflow Run RO-Crate in two use cases of machine learning in the digital image analysis domain.


Assuntos
Fluxo de Trabalho , Software , Aprendizado de Máquina , Reprodutibilidade dos Testes
10.
Methods Mol Biol ; 2632: 15-30, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36781718

RESUMO

Galaxy is a web browser-based data analysis platform that is widely used in biology. Public Galaxy instances allow the analysis of data and interpretation of results without requiring software installation. NanoGalaxy is a public Galaxy instance with tools and workflows for nanopore data analysis. This chapter describes the steps involved in performing genome assembly using short and long reads in NanoGalaxy.


Assuntos
Nanoporos , Software , Navegador , Fluxo de Trabalho , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos
11.
Gigascience ; 122022 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-37150537

RESUMO

BACKGROUND: Reproducibility of data analysis workflow is a key issue in the field of bioinformatics. Recent computing technologies, such as virtualization, have made it possible to reproduce workflow execution with ease. However, the reproducibility of results is not well discussed; that is, there is no standard way to verify whether the biological interpretation of reproduced results is the same. Therefore, it still remains a challenge to automatically evaluate the reproducibility of results. RESULTS: We propose a new metric, a reproducibility scale of workflow execution results, to evaluate the reproducibility of results. This metric is based on the idea of evaluating the reproducibility of results using biological feature values (e.g., number of reads, mapping rate, and variant frequency) representing their biological interpretation. We also implemented a prototype system that automatically evaluates the reproducibility of results using the proposed metric. To demonstrate our approach, we conducted an experiment using workflows used by researchers in real research projects and the use cases that are frequently encountered in the field of bioinformatics. CONCLUSIONS: Our approach enables automatic evaluation of the reproducibility of results using a fine-grained scale. By introducing our approach, it is possible to evolve from a binary view of whether the results are superficially identical or not to a more graduated view. We believe that our approach will contribute to more informed discussion on reproducibility in bioinformatics.


Assuntos
Biologia Computacional , Pesquisadores , Humanos , Fluxo de Trabalho , Reprodutibilidade dos Testes , Biologia Computacional/métodos , Software
12.
Gigascience ; 122022 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-36810800

RESUMO

BACKGROUND: Many open-source workflow systems have made bioinformatics data analysis procedures portable. Sharing these workflows provides researchers easy access to high-quality analysis methods without the requirement of computational expertise. However, published workflows are not always guaranteed to be reliably reusable. Therefore, a system is needed to lower the cost of sharing workflows in a reusable form. RESULTS: We introduce Yevis, a system to build a workflow registry that automatically validates and tests workflows to be published. The validation and test are based on the requirements we defined for a workflow being reusable with confidence. Yevis runs on GitHub and Zenodo and allows workflow hosting without the need of dedicated computing resources. A Yevis registry accepts workflow registration via a GitHub pull request, followed by an automatic validation and test process for the submitted workflow. As a proof of concept, we built a registry using Yevis to host workflows from a community to demonstrate how a workflow can be shared while fulfilling the defined requirements. CONCLUSIONS: Yevis helps in the building of a workflow registry to share reusable workflows without requiring extensive human resources. By following Yevis's workflow-sharing procedure, one can operate a registry while satisfying the reusable workflow criteria. This system is particularly useful to individuals or communities that want to share workflows but lacks the specific technical expertise to build and maintain a workflow registry from scratch.


Assuntos
Metadados , Software , Humanos , Fluxo de Trabalho , Biologia Computacional/métodos
13.
F1000Res ; 11: 889, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-39070189

RESUMO

The increased demand for efficient computation in data analysis encourages researchers in biomedical science to use workflow systems. Workflow systems, or so-called workflow languages, are used for the description and execution of a set of data analysis steps. Workflow systems increase the productivity of researchers, specifically in fields that use high-throughput DNA sequencing applications, where scalable computation is required. As systems have improved the portability of data analysis workflows, research communities are able to share workflows to reduce the cost of building ordinary analysis procedures. However, having multiple workflow systems in a research field has resulted in the distribution of efforts across different workflow system communities. As each workflow system has its unique characteristics, it is not feasible to learn every single system in order to use publicly shared workflows. Thus, we developed Sapporo, an application to provide a unified layer of workflow execution upon the differences of various workflow systems. Sapporo has two components: an application programming interface (API) that receives the request of a workflow run and a browser-based client for the API. The API follows the Workflow Execution Service API standard proposed by the Global Alliance for Genomics and Health. The current implementation supports the execution of workflows in four languages: Common Workflow Language, Workflow Description Language, Snakemake, and Nextflow. With its extensible and scalable design, Sapporo can support the research community in utilizing valuable resources for data analysis.


Assuntos
Biologia Computacional , Software , Fluxo de Trabalho , Biologia Computacional/métodos , Linguagens de Programação
14.
F1000Res ; 11: 1077, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36262334

RESUMO

The taxon Elasmobranchii (sharks and rays) contains one of the long-established evolutionary lineages of vertebrates with a tantalizing collection of species occupying critical aquatic habitats. To overcome the current limitation in molecular resources, we launched the Squalomix Consortium in 2020 to promote a genome-wide array of molecular approaches, specifically targeting shark and ray species. Among the various bottlenecks in working with elasmobranchs are their elusiveness and low fecundity as well as the large and highly repetitive genomes. Their peculiar body fluid composition has also hindered the establishment of methods to perform routine cell culturing required for their karyotyping. In the Squalomix consortium, these obstacles are expected to be solved through a combination of in-house cytological techniques including karyotyping of cultured cells, chromatin preparation for Hi-C data acquisition, and high fidelity long-read sequencing. The resources and products obtained in this consortium, including genome and transcriptome sequences, a genome browser powered by JBrowse2 to visualize sequence alignments, and comprehensive matrices of gene expression profiles for selected species are accessible through https://github.com/Squalomix/info.


Assuntos
Tubarões , Animais , Tubarões/genética , Genoma , Vertebrados , Cromatina , Disseminação de Informação
15.
Nat Genet ; 52(12): 1346-1354, 2020 12.
Artigo em Inglês | MEDLINE | ID: mdl-33257898

RESUMO

Poor trans-ancestry portability of polygenic risk scores is a consequence of Eurocentric genetic studies and limited knowledge of shared causal variants. Leveraging regulatory annotations may improve portability by prioritizing functional over tagging variants. We constructed a resource of 707 cell-type-specific IMPACT regulatory annotations by aggregating 5,345 epigenetic datasets to predict binding patterns of 142 transcription factors across 245 cell types. We then partitioned the common SNP heritability of 111 genome-wide association study summary statistics of European (average n ≈ 189,000) and East Asian (average n ≈ 157,000) origin. IMPACT annotations captured consistent SNP heritability between populations, suggesting prioritization of shared functional variants. Variant prioritization using IMPACT resulted in increased trans-ancestry portability of polygenic risk scores from Europeans to East Asians across all 21 phenotypes analyzed (49.9% mean relative increase in R2). Our study identifies a crucial role for functional annotations such as IMPACT to improve the trans-ancestry portability of genetic data.


Assuntos
Povo Asiático/genética , Elementos Facilitadores Genéticos/genética , Predisposição Genética para Doença/genética , Polimorfismo de Nucleotídeo Único/genética , População Branca/genética , Sequência de Bases , Biologia Computacional/métodos , Regulação da Expressão Gênica/genética , Estudo de Associação Genômica Ampla , Humanos , Modelos Genéticos , Anotação de Sequência Molecular , Herança Multifatorial/genética
16.
Gigascience ; 8(4)2019 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-31222199

RESUMO

BACKGROUND: Container virtualization technologies such as Docker are popular in the bioinformatics domain because they improve the portability and reproducibility of software deployment. Along with software packaged in containers, the standardized workflow descriptors Common Workflow Language (CWL) enable data to be easily analyzed on multiple computing environments. These technologies accelerate the use of on-demand cloud computing platforms, which can be scaled according to the quantity of data. However, to optimize the time and budgetary restraints of cloud usage, users must select a suitable instance type that corresponds to the resource requirements of their workflows. RESULTS: We developed CWL-metrics, a utility tool for cwltool (the reference implementation of CWL), to collect runtime metrics of Docker containers and workflow metadata to analyze workflow resource requirements. To demonstrate the use of this tool, we analyzed 7 transcriptome quantification workflows on 6 instance types. The results revealed that choice of instance type can deliver lower financial costs and faster execution times using the required amount of computational resources. CONCLUSIONS: CWL-metrics can generate a summary of resource requirements for workflow executions, which can help users to optimize their use of cloud computing by selecting appropriate instances. The runtime metrics data generated by CWL-metrics can also help users to share workflows between different workflow management frameworks.


Assuntos
Computação em Nuvem , Biologia Computacional/métodos , Genômica/métodos , Software , Sequenciamento de Nucleotídeos em Larga Escala , Fluxo de Trabalho
17.
Nat Commun ; 10(1): 4719, 2019 10 17.
Artigo em Inglês | MEDLINE | ID: mdl-31624269

RESUMO

Mosaic loss of chromosome Y (mLOY) is frequently observed in the leukocytes of ageing men. However, the genetic architecture and biological mechanisms underlying mLOY are not fully understood. In a cohort of 95,380 Japanese men, we identify 50 independent genetic markers in 46 loci associated with mLOY at a genome-wide significant level, 35 of which are unreported. Lead markers overlap enhancer marks in hematopoietic stem cells (HSCs, P ≤ 1.0 × 10-6). mLOY genome-wide association study signals exhibit polygenic architecture and demonstrate strong heritability enrichment in regions surrounding genes specifically expressed in multipotent progenitor (MPP) cells and HSCs (P ≤ 3.5 × 10-6). ChIP-seq data demonstrate that binding sites of FLI1, a fate-determining factor promoting HSC differentiation into platelets rather than red blood cells (RBCs), show a strong heritability enrichment (P = 1.5 × 10-6). Consistent with these findings, platelet and RBC counts are positively and negatively associated with mLOY, respectively. Collectively, our observations improve our understanding of the mechanisms underlying mLOY.


Assuntos
Diferenciação Celular/genética , Deleção Cromossômica , Cromossomos Humanos Y/genética , Estudo de Associação Genômica Ampla/métodos , Células-Tronco Hematopoéticas/metabolismo , Idoso , Idoso de 80 Anos ou mais , Povo Asiático/genética , Plaquetas/citologia , Plaquetas/metabolismo , Estudos de Coortes , Eritrócitos/citologia , Eritrócitos/metabolismo , Predisposição Genética para Doença/etnologia , Predisposição Genética para Doença/genética , Genótipo , Células-Tronco Hematopoéticas/citologia , Humanos , Japão , Masculino , Mosaicismo , Polimorfismo de Nucleotídeo Único
18.
Gigascience ; 6(6): 1-8, 2017 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-28449062

RESUMO

It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the large number of search results. Repository users need to be able to set a threshold to reduce the number of results to obtain a suitable subset of high-quality data for reanalysis. We calculated the quality of sequencing data archived in a public data repository, the Sequence Read Archive (SRA), by using the quality control software FastQC. We obtained quality values for 1 171 313 experiments, which can be used to evaluate the suitability of data for reuse. We also visualized the data distribution in SRA by integrating the quality information and metadata of experiments and samples. We provide quality information of all of the archived sequencing data, which enable users to obtain sufficient quality sequencing data for reanalyses. The calculated quality data are available to the public in various formats. Our data also provide an example of enhancing the reuse of public data by adding metadata to published research data by a third party.


Assuntos
Bases de Dados de Ácidos Nucleicos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Humanos , Internet , Software , Interface Usuário-Computador
19.
PLoS One ; 12(2): e0172269, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28234924

RESUMO

With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information.


Assuntos
DNA/genética , Bases de Dados de Ácidos Nucleicos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Polimorfismo Genético , Produtos Agrícolas/genética , DNA de Plantas , Genes de Plantas , Homozigoto , Anotação de Sequência Molecular , Oryza/genética , Fenótipo , Filogenia , Valores de Referência , Reprodutibilidade dos Testes , Software , Sorghum/genética , Zea mays/genética
20.
PLoS One ; 8(10): e77910, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24167589

RESUMO

High-throughput sequencing technology, also called next-generation sequencing (NGS), has the potential to revolutionize the whole process of genome sequencing, transcriptomics, and epigenetics. Sequencing data is captured in a public primary data archive, the Sequence Read Archive (SRA). As of January 2013, data from more than 14,000 projects have been submitted to SRA, which is double that of the previous year. Researchers can download raw sequence data from SRA website to perform further analyses and to compare with their own data. However, it is extremely difficult to search entries and download raw sequences of interests with SRA because the data structure is complicated, and experimental conditions along with raw sequences are partly described in natural language. Additionally, some sequences are of inconsistent quality because anyone can submit sequencing data to SRA with no quality check. Therefore, as a criterion of data quality, we focused on SRA entries that were cited in journal articles. We extracted SRA IDs and PubMed IDs (PMIDs) from SRA and full-text versions of journal articles and retrieved 2748 SRA ID-PMID pairs. We constructed a publication list referring to SRA entries. Since, one of the main themes of -omics analyses is clarification of disease mechanisms, we also characterized SRA entries by disease keywords, according to the Medical Subject Headings (MeSH) extracted from articles assigned to each SRA entry. We obtained 989 SRA ID-MeSH disease term pairs, and constructed a disease list referring to SRA data. We previously developed feature profiles of diseases in a system called "Gendoo". We generated hyperlinks between diseases extracted from SRA and the feature profiles of it. The developed project, publication and disease lists resulting from this study are available at our web service, called "DBCLS SRA" (http://sra.dbcls.jp/). This service will improve accessibility to high-quality data from SRA.


Assuntos
Mineração de Dados/métodos , Bases de Dados Genéticas , Medical Subject Headings , PubMed , Ferramenta de Busca/métodos , Animais , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA