Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 54
Filtrar
1.
Bioengineering (Basel) ; 11(3)2024 Mar 08.
Artigo em Inglês | MEDLINE | ID: mdl-38534537

RESUMO

As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.

2.
Cell Rep ; 42(11): 113380, 2023 11 28.
Artigo em Inglês | MEDLINE | ID: mdl-37950869

RESUMO

Coronary artery disease (CAD) is characterized by atherosclerotic plaque formation in the arterial wall. CAD progression involves complex interactions and phenotypic plasticity among vascular and immune cell lineages. Single-cell RNA-seq (scRNA-seq) studies have highlighted lineage-specific transcriptomic signatures, but human cell phenotypes remain controversial. Here, we perform an integrated meta-analysis of 22 scRNA-seq libraries to generate a comprehensive map of human atherosclerosis with 118,578 cells. Besides characterizing granular cell-type diversity and communication, we leverage this atlas to provide insights into smooth muscle cell (SMC) modulation. We integrate genome-wide association study data and uncover a critical role for modulated SMC phenotypes in CAD, myocardial infarction, and coronary calcification. Finally, we identify fibromyocyte/fibrochondrogenic SMC markers (LTBP1 and CRTAC1) as proxies of atherosclerosis progression and validate these through omics and spatial imaging analyses. Altogether, we create a unified atlas of human atherosclerosis informing cell state-specific mechanistic and translational studies of cardiovascular diseases.


Assuntos
Aterosclerose , Doença da Artéria Coronariana , Infarto do Miocárdio , Placa Aterosclerótica , Humanos , Estudo de Associação Genômica Ampla , Aterosclerose/genética , Doença da Artéria Coronariana/genética , Miócitos de Músculo Liso , Proteínas de Ligação ao Cálcio/genética
3.
bioRxiv ; 2023 Aug 18.
Artigo em Inglês | MEDLINE | ID: mdl-37645717

RESUMO

Background: As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself. Results: Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data. Availability: https://pephub.databio.org.

6.
Bioinformatics ; 39(4)2023 04 03.
Artigo em Inglês | MEDLINE | ID: mdl-37067481

RESUMO

SUMMARY: Exclusion regions are sections of reference genomes with abnormal pileups of short sequencing reads. Removing reads overlapping them improves biological signal, and these benefits are most pronounced in differential analysis settings. Several labs created exclusion region sets, available primarily through ENCODE and Github. However, the variety of exclusion sets creates uncertainty which sets to use. Furthermore, gap regions (e.g. centromeres, telomeres, short arms) create additional considerations in generating exclusion sets. We generated exclusion sets for the latest human T2T-CHM13 and mouse GRCm39 genomes and systematically assembled and annotated these and other sets in the excluderanges R/Bioconductor data package, also accessible via the BEDbase.org API. The package provides unified access to 82 GenomicRanges objects covering six organisms, multiple genome assemblies, and types of exclusion regions. For human hg38 genome assembly, we recommend hg38.Kundaje.GRCh38_unified_blacklist as the most well-curated and annotated, and sets generated by the Blacklist tool for other organisms. AVAILABILITY AND IMPLEMENTATION: https://bioconductor.org/packages/excluderanges/. Package website: https://dozmorovlab.github.io/excluderanges/.


Assuntos
Genoma Humano , Software , Animais , Humanos , Camundongos , Incerteza
8.
Bioinformatics ; 39(3)2023 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-36857584

RESUMO

MOTIVATION: The Gene Expression Omnibus has become an important source of biological data for secondary analysis. However, there is no simple, programmatic way to download data and metadata from Gene Expression Omnibus (GEO) in a standardized annotation format. RESULTS: To address this, we present GEOfetch-a command-line tool that downloads and organizes data and metadata from GEO and SRA. GEOfetch formats the downloaded metadata as a Portable Encapsulated Project, providing universal format for the reanalysis of public data. AVAILABILITY AND IMPLEMENTATION: GEOfetch is available on Bioconda and the Python Package Index (PyPI).


Assuntos
Expressão Gênica , Metadados , Biologia Computacional
9.
bioRxiv ; 2023 Jan 20.
Artigo em Inglês | MEDLINE | ID: mdl-36711565

RESUMO

Rationale: Renin cells are essential for survival. They control the morphogenesis of the kidney arterioles, and the composition and volume of our extracellular fluid, arterial blood pressure, tissue perfusion, and oxygen delivery. It is known that renin cells and associated arteriolar cells descend from FoxD1 + progenitor cells, yet renin cells remain challenging to study due in no small part to their rarity within the kidney. As such, the molecular mechanisms underlying the differentiation and maintenance of these cells remain insufficiently understood. Objective: We sought to comprehensively evaluate the chromatin states and transcription factors (TFs) that drive the differentiation of FoxD1 + progenitor cells into those that compose the kidney vasculature with a focus on renin cells. Methods and Results: We isolated single nuclei of FoxD1 + progenitor cells and their descendants from FoxD1 cre/+ ; R26R-mTmG mice at embryonic day 12 (E12) (n cells =1234), embryonic day 18 (E18) (n cells =3696), postnatal day 5 (P5) (n cells =1986), and postnatal day 30 (P30) (n cells =1196). Using integrated scRNA-seq and scATAC-seq we established the developmental trajectory that leads to the mosaic of cells that compose the kidney arterioles, and specifically identified the factors that determine the elusive, myo-endocrine adult renin-secreting juxtaglomerular (JG) cell. We confirm the role of Nfix in JG cell development and renin expression, and identified the myocyte enhancer factor-2 (MEF2) family of TFs as putative drivers of JG cell differentiation. Conclusions: We provide the first developmental trajectory of renin cell differentiation as they become JG cells in a single-cell atlas of kidney vascular open chromatin and highlighted novel factors important for their stage-specific differentiation. This improved understanding of the regulatory landscape of renin expressing JG cells is necessary to better learn the control and function of this rare cell population as overactivation or aberrant activity of the RAS is a key factor in cardiovascular and kidney pathologies.

11.
Bioinform Adv ; 2(1): vbac030, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35669346

RESUMO

Summary: Properly and effectively managing reference datasets is an important task for many bioinformatics analyses. Refgenie is a reference asset management system that allows users to easily organize, retrieve and share such datasets. Here, we describe the integration of refgenie into the Galaxy platform. Server administrators are able to configure Galaxy to make use of reference datasets made available on a refgenie instance. In addition, a Galaxy Data Manager tool has been developed to provide a graphical interface to refgenie's remote reference retrieval functionality. A large collection of reference datasets has also been made available using the CVMFS (CernVM File System) repository from GalaxyProject.org, with mirrors across the USA, Canada, Europe and Australia, enabling easy use outside of Galaxy. Availability and implementation: The ability of Galaxy to use refgenie assets was added to the core Galaxy framework in version 22.01, which is available from https://github.com/galaxyproject/galaxy under the Academic Free License version 3.0. The refgenie Data Manager tool can be installed via the Galaxy ToolShed, with source code managed at https://github.com/BlankenbergLab/galaxy-tools-blankenberg/tree/main/data_managers/data_manager_refgenie_pull and released using an MIT license. Access to existing data is also available through CVMFS, with instructions at https://galaxyproject.org/admin/reference-data-repo/. No new data were generated or analyzed in support of this research.

12.
Proc Natl Acad Sci U S A ; 119(26): e2201267119, 2022 06 28.
Artigo em Inglês | MEDLINE | ID: mdl-35733248

RESUMO

Delineating gene regulatory networks that orchestrate cell-type specification is a continuing challenge for developmental biologists. Single-cell analyses offer opportunities to address these challenges and accelerate discovery of rare cell lineage relationships and mechanisms underlying hierarchical lineage decisions. Here, we describe the molecular analysis of mouse pancreatic endocrine cell differentiation using single-cell transcriptomics, chromatin accessibility assays coupled to genetic labeling, and cytometry-based cell purification. We uncover transcription factor networks that delineate ß-, α-, and δ-cell lineages. Through genomic footprint analysis, we identify transcription factor-regulatory DNA interactions governing pancreatic cell development at unprecedented resolution. Our analysis suggests that the transcription factor Neurog3 may act as a pioneer transcription factor to specify the pancreatic endocrine lineage. These findings could improve protocols to generate replacement endocrine cells from renewable sources, like stem cells, for diabetes therapy.


Assuntos
Fatores de Transcrição Hélice-Alça-Hélice Básicos , Cromatina , Ilhotas Pancreáticas , Proteínas do Tecido Nervoso , Transcriptoma , Animais , Fatores de Transcrição Hélice-Alça-Hélice Básicos/genética , Fatores de Transcrição Hélice-Alça-Hélice Básicos/metabolismo , Diferenciação Celular/genética , Linhagem da Célula/genética , Cromatina/genética , Cromatina/metabolismo , Regulação da Expressão Gênica no Desenvolvimento , Ilhotas Pancreáticas/crescimento & desenvolvimento , Ilhotas Pancreáticas/metabolismo , Camundongos , Proteínas do Tecido Nervoso/genética , Proteínas do Tecido Nervoso/metabolismo , Análise de Célula Única
13.
BMC Genomics ; 23(1): 299, 2022 Apr 12.
Artigo em Inglês | MEDLINE | ID: mdl-35413804

RESUMO

BACKGROUND: Epigenome analysis relies on defined sets of genomic regions output by widely used assays such as ChIP-seq and ATAC-seq. Statistical analysis and visualization of genomic region sets is essential to answer biological questions in gene regulation. As the epigenomics community continues generating data, there will be an increasing need for software tools that can efficiently deal with more abundant and larger genomic region sets. Here, we introduce GenomicDistributions, an R package for fast and easy summarization and visualization of genomic region data. RESULTS: GenomicDistributions offers a broad selection of functions to calculate properties of genomic region sets, such as feature distances, genomic partition overlaps, and more. GenomicDistributions functions are meticulously optimized for best-in-class speed and generally outperform comparable functions in existing R packages. GenomicDistributions also offers plotting functions that produce editable ggplot objects. All GenomicDistributions functions follow a uniform naming scheme and can handle either single or multiple region set inputs. CONCLUSIONS: GenomicDistributions offers a fast and scalable tool for exploratory genomic region set analysis and visualization. GenomicDistributions excels in user-friendliness, flexibility of outputs, breadth of functions, and computational performance. GenomicDistributions is available from Bioconductor ( https://bioconductor.org/packages/release/bioc/html/GenomicDistributions.html ).


Assuntos
Genômica , Software , Sequenciamento de Cromatina por Imunoprecipitação , Epigenômica , Genoma
14.
Cell Rep Methods ; 2(1)2022 01 24.
Artigo em Inglês | MEDLINE | ID: mdl-35211690

RESUMO

We present a data integration framework that uses non-negative matrix factorization of patient-similarity networks to integrate continuous multi-omics datasets for molecular subtyping. It is demonstrated to have the capability to handle missing data without using imputation and to be consistently among the best in detecting subtypes with differential prognosis and enrichment of clinical associations in a large number of cancers. When applying the approach to data from individuals with lower-grade gliomas, we identify a subtype with a significantly worse prognosis. Tumors assigned to this subtype are hypomethylated genome wide with a gain of AP-1 occupancy in demethylated distal enhancers. The tumors are also enriched for somatic chromosome 7 (chr7) gain, chr10 loss, and other molecular events that have been suggested as diagnostic markers for "IDH wild type, with molecular features of glioblastoma" by the cIMPACT-NOW consortium but have yet to be included in the World Health Organization (WHO) guidelines.


Assuntos
Glioblastoma , Glioma , Humanos , Multiômica , Glioma/diagnóstico , Glioblastoma/diagnóstico , Prognóstico , Aberrações Cromossômicas
15.
Gigascience ; 10(12)2021 12 06.
Artigo em Inglês | MEDLINE | ID: mdl-34890448

RESUMO

BACKGROUND: Organizing and annotating biological sample data is critical in data-intensive bioinformatics. Unfortunately, metadata formats from a data provider are often incompatible with requirements of a processing tool. There is no broadly accepted standard to organize metadata across biological projects and bioinformatics tools, restricting the portability and reusability of both annotated datasets and analysis software. RESULTS: To address this, we present the Portable Encapsulated Project (PEP) specification, a formal specification for biological sample metadata structure. The PEP specification accommodates typical features of data-intensive bioinformatics projects with many biological samples. In addition to standardization, the PEP specification provides descriptors and modifiers for project-level and sample-level metadata, which improve portability across both computing environments and data processing tools. PEPs include a schema validator framework, allowing formal definition of required metadata attributes for data analysis broadly. We have implemented packages for reading PEPs in both Python and R to provide a language-agnostic interface for organizing project metadata. CONCLUSIONS: The PEP specification is an important step toward unifying data annotation and processing tools in data-intensive biological research projects. Links to tools and documentation are available at http://pep.databio.org/.


Assuntos
Metadados , Software , Biologia Computacional , Documentação
16.
NAR Genom Bioinform ; 3(4): lqab101, 2021 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34859208

RESUMO

As chromatin accessibility data from ATAC-seq experiments continues to expand, there is continuing need for standardized analysis pipelines. Here, we present PEPATAC, an ATAC-seq pipeline that is easily applied to ATAC-seq projects of any size, from one-off experiments to large-scale sequencing projects. PEPATAC leverages unique features of ATAC-seq data to optimize for speed and accuracy, and it provides several unique analytical approaches. Output includes convenient quality control plots, summary statistics, and a variety of generally useful data formats to set the groundwork for subsequent project-specific data analysis. Downstream analysis is simplified by a standard definition format, modularity of components, and metadata APIs in R and Python. It is restartable, fault-tolerant, and can be run on local hardware, using any cluster resource manager, or in provided Linux containers. We also demonstrate the advantage of aligning to the mitochondrial genome serially, which improves the accuracy of alignment statistics and quality control metrics. PEPATAC is a robust and portable first step for any ATAC-seq project. BSD2-licensed code and documentation are available at https://pepatac.databio.org.

17.
Genome Biol ; 22(1): 238, 2021 08 20.
Artigo em Inglês | MEDLINE | ID: mdl-34416909

RESUMO

Functional genomics experiments, like ChIP-Seq or ATAC-Seq, produce results that are summarized as a region set. There is no way to objectively evaluate the effectiveness of region set similarity metrics. We present Bedshift, a tool for perturbing BED files by randomly shifting, adding, and dropping regions from a reference file. The perturbed files can be used to benchmark similarity metrics, as well as for other applications. We highlight differences in behavior between metrics, such as that the Jaccard score is most sensitive to added or dropped regions, while coverage score is most sensitive to shifted regions.


Assuntos
Genoma , Genômica/métodos , Software , Sequenciamento de Cromatina por Imunoprecipitação , Células HCT116 , Humanos
18.
Bioinformatics ; 38(1): 299-300, 2021 12 22.
Artigo em Inglês | MEDLINE | ID: mdl-34260694

RESUMO

MOTIVATION: Reference sequences are essential in creating a baseline of knowledge for many common bioinformatics methods, especially those using genomic sequencing. RESULTS: We have created refget, a Global Alliance for Genomics and Health API specification to access reference sequences and sub-sequences using an identifier derived from the sequence itself. We present four reference implementations across in-house and cloud infrastructure, a compliance suite and a web report used to ensure specification conformity across implementations. AVAILABILITY AND IMPLEMENTATION: The refget specification can be found at: https://w3id.org/ga4gh/refget. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica , Software
19.
Bioinformatics ; 37(23): 4299-4306, 2021 12 07.
Artigo em Inglês | MEDLINE | ID: mdl-34156475

RESUMO

MOTIVATION: Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis. RESULTS: We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data. AVAILABILITY AND IMPLEMENTATION: https://github.com/databio/regionset-embedding. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica , Ligação Proteica
20.
Nat Commun ; 12(1): 3230, 2021 05 28.
Artigo em Inglês | MEDLINE | ID: mdl-34050156

RESUMO

Sequencing of cell-free DNA in the blood of cancer patients (liquid biopsy) provides attractive opportunities for early diagnosis, assessment of treatment response, and minimally invasive disease monitoring. To unlock liquid biopsy analysis for pediatric tumors with few genetic aberrations, we introduce an integrated genetic/epigenetic analysis method and demonstrate its utility on 241 deep whole-genome sequencing profiles of 95 patients with Ewing sarcoma and 31 patients with other pediatric sarcomas. Our method achieves sensitive detection and classification of circulating tumor DNA in peripheral blood independent of any genetic alterations. Moreover, we benchmark different metrics for cell-free DNA fragmentation analysis, and we introduce the LIQUORICE algorithm for detecting circulating tumor DNA based on cancer-specific chromatin signatures. Finally, we combine several fragmentation-based metrics into an integrated machine learning classifier for liquid biopsy analysis that exploits widespread epigenetic deregulation and is tailored to cancers with low mutation rates. Clinical associations highlight the potential value of cfDNA fragmentation patterns as prognostic biomarkers in Ewing sarcoma. In summary, our study provides a comprehensive analysis of circulating tumor DNA beyond recurrent genetic aberrations, and it renders the benefits of liquid biopsy more readily accessible for childhood cancers.


Assuntos
Biomarcadores Tumorais/sangue , Neoplasias Ósseas/diagnóstico , DNA Tumoral Circulante/sangue , Sarcoma de Ewing/diagnóstico , Adolescente , Adulto , Biomarcadores Tumorais/genética , Neoplasias Ósseas/sangue , Neoplasias Ósseas/genética , Neoplasias Ósseas/patologia , Estudos de Casos e Controles , Criança , Pré-Escolar , DNA Tumoral Circulante/genética , Análise Mutacional de DNA , Feminino , Humanos , Lactente , Biópsia Líquida/métodos , Masculino , Pessoa de Meia-Idade , Mutação , Sarcoma de Ewing/sangue , Sarcoma de Ewing/genética , Sarcoma de Ewing/patologia , Sequenciamento Completo do Genoma , Adulto Jovem
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...