Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 111
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Cell ; 176(1-2): 377-390.e19, 2019 01 10.
Artículo en Inglés | MEDLINE | ID: mdl-30612741

RESUMEN

Over one million candidate regulatory elements have been identified across the human genome, but nearly all are unvalidated and their target genes uncertain. Approaches based on human genetics are limited in scope to common variants and in resolution by linkage disequilibrium. We present a multiplex, expression quantitative trait locus (eQTL)-inspired framework for mapping enhancer-gene pairs by introducing random combinations of CRISPR/Cas9-mediated perturbations to each of many cells, followed by single-cell RNA sequencing (RNA-seq). Across two experiments, we used dCas9-KRAB to perturb 5,920 candidate enhancers with no strong a priori hypothesis as to their target gene(s), measuring effects by profiling 254,974 single-cell transcriptomes. We identified 664 (470 high-confidence) cis enhancer-gene pairs, which were enriched for specific transcription factors, non-housekeeping status, and genomic and 3D conformational proximity to their target genes. This framework will facilitate the large-scale mapping of enhancer-gene regulatory interactions, a critical yet largely uncharted component of the cis-regulatory landscape of the human genome.


Asunto(s)
Mapeo Cromosómico/métodos , Elementos de Facilitación Genéticos/genética , Regulación de la Expresión Génica/genética , Sistemas CRISPR-Cas/genética , Repeticiones Palindrómicas Cortas Agrupadas y Regularmente Espaciadas/genética , Perfilación de la Expresión Génica , Redes Reguladoras de Genes/genética , Genoma Humano , Estudio de Asociación del Genoma Completo , Genómica , Humanos , Sitios de Carácter Cuantitativo , Factores de Transcripción/genética
2.
Mol Cell ; 83(15): 2624-2640, 2023 08 03.
Artículo en Inglés | MEDLINE | ID: mdl-37419111

RESUMEN

The four-dimensional nucleome (4DN) consortium studies the architecture of the genome and the nucleus in space and time. We summarize progress by the consortium and highlight the development of technologies for (1) mapping genome folding and identifying roles of nuclear components and bodies, proteins, and RNA, (2) characterizing nuclear organization with time or single-cell resolution, and (3) imaging of nuclear organization. With these tools, the consortium has provided over 2,000 public datasets. Integrative computational models based on these data are starting to reveal connections between genome structure and function. We then present a forward-looking perspective and outline current aims to (1) delineate dynamics of nuclear architecture at different timescales, from minutes to weeks as cells differentiate, in populations and in single cells, (2) characterize cis-determinants and trans-modulators of genome organization, (3) test functional consequences of changes in cis- and trans-regulators, and (4) develop predictive models of genome structure and function.


Asunto(s)
Núcleo Celular , Genoma , Genoma/genética , Núcleo Celular/genética , Núcleo Celular/metabolismo , Cromatina/metabolismo
4.
Nat Rev Genet ; 23(3): 169-181, 2022 03.
Artículo en Inglés | MEDLINE | ID: mdl-34837041

RESUMEN

The scale of genetic, epigenomic, transcriptomic, cheminformatic and proteomic data available today, coupled with easy-to-use machine learning (ML) toolkits, has propelled the application of supervised learning in genomics research. However, the assumptions behind the statistical models and performance evaluations in ML software frequently are not met in biological systems. In this Review, we illustrate the impact of several common pitfalls encountered when applying supervised ML in genomics. We explore how the structure of genomics data can bias performance evaluations and predictions. To address the challenges associated with applying cutting-edge ML methods to genomics, we describe solutions and appropriate use cases where ML modelling shows great potential.


Asunto(s)
Genómica/métodos , Aprendizaje Automático , Animales , Genómica/normas , Genómica/tendencias , Humanos , Aprendizaje Automático/normas , Modelos Estadísticos , Programas Informáticos
5.
Mol Cell ; 78(5): 890-902.e6, 2020 06 04.
Artículo en Inglés | MEDLINE | ID: mdl-32416068

RESUMEN

Acidic transcription activation domains (ADs) are encoded by a wide range of seemingly unrelated amino acid sequences, making it difficult to recognize features that promote their dynamic behavior, "fuzzy" interactions, and target specificity. We screened a large set of random 30-mer peptides for AD function in yeast and trained a deep neural network (ADpred) on the AD-positive and -negative sequences. ADpred identifies known acidic ADs within transcription factors and accurately predicts the consequences of mutations. Our work reveals that strong acidic ADs contain multiple clusters of hydrophobic residues near acidic side chains, explaining why ADs often have a biased amino acid composition. ADs likely use a binding mechanism similar to avidity where a minimum number of weak dynamic interactions are required between activator and target to generate biologically relevant affinity and in vivo function. This mechanism explains the basis for fuzzy binding observed between acidic ADs and targets.


Asunto(s)
Ensayos Analíticos de Alto Rendimiento/métodos , Factores de Transcripción/genética , Activación Transcripcional/genética , Secuencia de Aminoácidos/genética , Factores de Transcripción con Cremalleras de Leucina de Carácter Básico/genética , Proteínas de Unión al ADN/metabolismo , Aprendizaje Profundo , Unión Proteica , Dominios Proteicos/genética , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo , Transactivadores/genética , Transactivadores/metabolismo , Factores de Transcripción/metabolismo , Activación Transcripcional/fisiología
6.
Mol Cell ; 76(4): 676-690.e10, 2019 11 21.
Artículo en Inglés | MEDLINE | ID: mdl-31495564

RESUMEN

Conventional methods for single-cell genome sequencing are limited with respect to uniformity and throughput. Here, we describe sci-L3, a single-cell sequencing method that combines combinatorial indexing (sci-) and linear (L) amplification. The sci-L3 method adopts a 3-level (3) indexing scheme that minimizes amplification biases while enabling exponential gains in throughput. We demonstrate the generalizability of sci-L3 with proof-of-concept demonstrations of single-cell whole-genome sequencing (sci-L3-WGS), targeted sequencing (sci-L3-target-seq), and a co-assay of the genome and transcriptome (sci-L3-RNA/DNA). We apply sci-L3-WGS to profile the genomes of >10,000 sperm and sperm precursors from F1 hybrid mice, mapping 86,786 crossovers and characterizing rare chromosome mis-segregation events in meiosis, including instances of whole-genome equational chromosome segregation. We anticipate that sci-L3 assays can be applied to fully characterize recombination landscapes, to couple CRISPR perturbations and measurements of genome stability, and to other goals requiring high-throughput, high-coverage single-cell sequencing.


Asunto(s)
Perfilación de la Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento , Técnicas de Amplificación de Ácido Nucleico , Análisis de Secuencia de ADN , Análisis de Secuencia de ARN , Análisis de la Célula Individual/métodos , Secuenciación Completa del Genoma , Animales , Segregación Cromosómica , Masculino , Meiosis/genética , Ratones , Prueba de Estudio Conceptual , Espermatozoides/fisiología , Transcriptoma , Flujo de Trabajo
7.
Nature ; 583(7818): 699-710, 2020 07.
Artículo en Inglés | MEDLINE | ID: mdl-32728249

RESUMEN

The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.


Asunto(s)
ADN/genética , Bases de Datos Genéticas , Genoma/genética , Genómica , Anotación de Secuencia Molecular , Sistema de Registros , Secuencias Reguladoras de Ácidos Nucleicos/genética , Animales , Cromatina/genética , Cromatina/metabolismo , ADN/química , Huella de ADN , Metilación de ADN/genética , Momento de Replicación del ADN , Desoxirribonucleasa I/metabolismo , Genoma Humano , Histonas/metabolismo , Humanos , Ratones , Ratones Transgénicos , Proteínas de Unión al ARN/genética , Transcripción Genética/genética , Transposasas/metabolismo
8.
Bioinformatics ; 40(Supplement_1): i471-i480, 2024 Jun 28.
Artículo en Inglés | MEDLINE | ID: mdl-38940142

RESUMEN

MOTIVATION: High-resolution Hi-C contact matrices reveal the detailed three-dimensional architecture of the genome, but high-coverage experimental Hi-C data are expensive to generate. Simultaneously, chromatin structure analyses struggle with extremely sparse contact matrices. To address this problem, computational methods to enhance low-coverage contact matrices have been developed, but existing methods are largely based on resolution enhancement methods for natural images and hence often employ models that do not distinguish between biologically meaningful contacts, such as loops and other stochastic contacts. RESULTS: We present Capricorn, a machine learning model for Hi-C resolution enhancement that incorporates small-scale chromatin features as additional views of the input Hi-C contact matrix and leverages a diffusion probability model backbone to generate a high-coverage matrix. We show that Capricorn outperforms the state of the art in a cross-cell-line setting, improving on existing methods by 17% in mean squared error and 26% in F1 score for chromatin loop identification from the generated high-coverage data. We also demonstrate that Capricorn performs well in the cross-chromosome setting and cross-chromosome, cross-cell-line setting, improving the downstream loop F1 score by 14% relative to existing methods. We further show that our multiview idea can also be used to improve several existing methods, HiCARN and HiCNN, indicating the wide applicability of this approach. Finally, we use DNA sequence to validate discovered loops and find that the fraction of CTCF-supported loops from Capricorn is similar to those identified from the high-coverage data. Capricorn is a powerful Hi-C resolution enhancement method that enables scientists to find chromatin features that cannot be identified in the low-coverage contact matrix. AVAILABILITY AND IMPLEMENTATION: Implementation of Capricorn and source code for reproducing all figures in this paper are available at https://github.com/CHNFTQ/Capricorn.


Asunto(s)
Cromatina , Aprendizaje Automático , Cromatina/química , Cromatina/metabolismo , Humanos , Biología Computacional/métodos , Algoritmos , Programas Informáticos
9.
J Proteome Res ; 23(6): 1894-1906, 2024 Jun 07.
Artículo en Inglés | MEDLINE | ID: mdl-38652578

RESUMEN

Searching for tandem mass spectrometry proteomics data against a database is a well-established method for assigning peptide sequences to observed spectra but typically cannot identify peptides harboring unexpected post-translational modifications (PTMs). Open modification searching aims to address this problem by allowing a spectrum to match a peptide even if the spectrum's precursor mass differs from the peptide mass. However, expanding the search space in this way can lead to a loss of statistical power to detect peptides. We therefore developed a method, called CONGA (combining open and narrow searches with group-wise analysis), that takes into account results from both types of searches─a traditional "narrow window" search and an open modification search─while carrying out rigorous false discovery rate control. The result is an algorithm that provides the best of both worlds: the ability to detect unexpected PTMs without a concomitant loss of power to detect unmodified peptides.


Asunto(s)
Algoritmos , Bases de Datos de Proteínas , Procesamiento Proteico-Postraduccional , Proteómica , Espectrometría de Masas en Tándem , Espectrometría de Masas en Tándem/métodos , Proteómica/métodos , Péptidos/análisis , Péptidos/química , Humanos , Programas Informáticos , Secuencia de Aminoácidos
10.
Bioinformatics ; 39(1)2023 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-36594573

RESUMEN

MOTIVATION: We address the challenge of inferring a consensus 3D model of genome architecture from Hi-C data. Existing approaches most often rely on a two-step algorithm: first, convert the contact counts into distances, then optimize an objective function akin to multidimensional scaling (MDS) to infer a 3D model. Other approaches use a maximum likelihood approach, modeling the contact counts between two loci as a Poisson random variable whose intensity is a decreasing function of the distance between them. However, a Poisson model of contact counts implies that the variance of the data is equal to the mean, a relationship that is often too restrictive to properly model count data. RESULTS: We first confirm the presence of overdispersion in several real Hi-C datasets, and we show that the overdispersion arises even in simulated datasets. We then propose a new model, called Pastis-NB, where we replace the Poisson model of contact counts by a negative binomial one, which is parametrized by a mean and a separate dispersion parameter. The dispersion parameter allows the variance to be adjusted independently from the mean, thus better modeling overdispersed data. We compare the results of Pastis-NB to those of several previously published algorithms, both MDS-based and statistical methods. We show that the negative binomial inference yields more accurate structures on simulated data, and more robust structures than other models across real Hi-C replicates and across different resolutions. AVAILABILITY AND IMPLEMENTATION: A Python implementation of Pastis-NB is available at https://github.com/hiclib/pastis under the BSD license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Genoma , Funciones de Verosimilitud
11.
J Proteome Res ; 22(11): 3427-3438, 2023 11 03.
Artículo en Inglés | MEDLINE | ID: mdl-37861703

RESUMEN

Quantitative measurements produced by tandem mass spectrometry proteomics experiments typically contain a large proportion of missing values. Missing values hinder reproducibility, reduce statistical power, and make it difficult to compare across samples or experiments. Although many methods exist for imputing missing values, in practice, the most commonly used methods are among the worst performing. Furthermore, previous benchmarking studies have focused on relatively simple measurements of error such as the mean-squared error between imputed and held-out values. Here we evaluate the performance of commonly used imputation methods using three practical, "downstream-centric" criteria. These criteria measure the ability to identify differentially expressed peptides, generate new quantitative peptides, and improve the peptide lower limit of quantification. Our evaluation comprises several experiment types and acquisition strategies, including data-dependent and data-independent acquisition. We find that imputation does not necessarily improve the ability to identify differentially expressed peptides but that it can identify new quantitative peptides and improve the peptide lower limit of quantification. We find that MissForest is generally the best performing method per our downstream-centric criteria. We also argue that existing imputation methods do not properly account for the variance of peptide quantifications and highlight the need for methods that do.


Asunto(s)
Algoritmos , Proteómica , Proteómica/métodos , Reproducibilidad de los Resultados , Espectrometría de Masas en Tándem , Péptidos/análisis
12.
J Proteome Res ; 22(7): 2172-2178, 2023 07 07.
Artículo en Inglés | MEDLINE | ID: mdl-37261867

RESUMEN

Controlling the false discovery rate (FDR) among discoveries from a tandem mass spectrometry proteomics experiment using target decoy competition (TDC) controls only the proportion of false discoveries in an average sense. Thus, for any particular analysis, even with a valid FDR control procedure, the proportion of false discoveries (the FDP) may be higher than the specified FDR threshold. We demonstrate this phenomenon using real data and describe two recently developed methods that help bridge the gap between controlling the expected or average rate of false discoveries and the empirical rate (FDP). The FDP Stepdown method controls the FDP at any desired confidence level, and the TDC Uniform Band provides a confidence, or upper prediction bound, on the FDP in TDC's list of discoveries.


Asunto(s)
Algoritmos , Proteómica , Bases de Datos de Proteínas , Proteómica/métodos , Espectrometría de Masas en Tándem
15.
J Proteome Res ; 21(6): 1382-1391, 2022 06 03.
Artículo en Inglés | MEDLINE | ID: mdl-35549345

RESUMEN

Advances in library-based methods for peptide detection from data-independent acquisition (DIA) mass spectrometry have made it possible to detect and quantify tens of thousands of peptides in a single mass spectrometry run. However, many of these methods rely on a comprehensive, high-quality spectral library containing information about the expected retention time and fragmentation patterns of peptides in the sample. Empirical spectral libraries are often generated through data-dependent acquisition and may suffer from biases as a result. Spectral libraries can be generated in silico, but these models are not trained to handle all possible post-translational modifications. Here, we propose a false discovery rate-controlled spectrum-centric search workflow to generate spectral libraries directly from gas-phase fractionated DIA tandem mass spectrometry data. We demonstrate that this strategy is able to detect phosphorylated peptides and can be used to generate a spectral library for accurate peptide detection and quantitation in wide-window DIA data. We compare the results of this search workflow to other library-free approaches and demonstrate that our search is competitive in terms of accuracy and sensitivity. These results demonstrate that the proposed workflow has the capacity to generate spectral libraries while avoiding the limitations of other methods.


Asunto(s)
Péptidos , Espectrometría de Masas en Tándem , Biblioteca de Péptidos , Péptidos/análisis , Procesamiento Proteico-Postraduccional , Proteoma/análisis , Espectrometría de Masas en Tándem/métodos , Flujo de Trabajo
16.
J Proteome Res ; 20(4): 1966-1971, 2021 04 02.
Artículo en Inglés | MEDLINE | ID: mdl-33596079

RESUMEN

Proteomics studies rely on the accurate assignment of peptides to the acquired tandem mass spectra-a task where machine learning algorithms have proven invaluable. We describe mokapot, which provides a flexible semisupervised learning algorithm that allows for highly customized analyses. We demonstrate some of the unique features of mokapot by improving the detection of RNA-cross-linked peptides from an analysis of RNA-binding proteins and increasing the consistency of peptide detection in a single-cell proteomics study.


Asunto(s)
Péptidos , Proteómica , Algoritmos , Bases de Datos de Proteínas , Espectrometría de Masas en Tándem
17.
J Proteome Res ; 20(9): 4621-4624, 2021 09 03.
Artículo en Inglés | MEDLINE | ID: mdl-34342226

RESUMEN

The volume of proteomics and mass spectrometry data available in public repositories continues to grow at a rapid pace as more researchers embrace open science practices. Open access to the data behind scientific discoveries has become critical to validate published findings and develop new computational tools. Here, we present ppx, a Python package that provides easy, programmatic access to the data stored in ProteomeXchange repositories, such as PRIDE and MassIVE. The ppx package can be used as either a command line tool or a Python package to retrieve the files and metadata associated with a project when provided its identifier. To demonstrate how ppx enhances reproducible research, we used ppx within a Snakemake workflow to reanalyze a published data set with the open modification search tool ANN-SoLo and compared our reanalysis to the original results. We show that ppx readily integrates into workflows, and our reanalysis produced results consistent with the original analysis. We envision that ppx will be a valuable tool for creating reproducible analyses, providing tool developers easy access to data for development, testing, and benchmarking, and enabling the use of mass spectrometry data in data-intensive analyses. The ppx package is freely available and open source under the MIT license at https://github.com/wfondrie/ppx.


Asunto(s)
Proteómica , Programas Informáticos , Espectrometría de Masas , Metadatos , Motor de Búsqueda
18.
J Proteome Res ; 20(8): 4153-4164, 2021 08 06.
Artículo en Inglés | MEDLINE | ID: mdl-34236864

RESUMEN

The standard proteomics database search strategy involves searching spectra against a peptide database and estimating the false discovery rate (FDR) of the resulting set of peptide-spectrum matches. One assumption of this protocol is that all the peptides in the database are relevant to the hypothesis being investigated. However, in settings where researchers are interested in a subset of peptides, alternative search and FDR control strategies are needed. Recently, two methods were proposed to address this problem: subset-search and all-sub. We show that both methods fail to control the FDR. For subset-search, this failure is due to the presence of "neighbor" peptides, which are defined as irrelevant peptides with a similar precursor mass and fragmentation spectrum as a relevant peptide. Not considering neighbors compromises the FDR estimate because a spectrum generated by an irrelevant peptide can incorrectly match well to a relevant peptide. Therefore, we have developed a new method, "subset-neighbor search" (SNS), that accounts for neighbor peptides. We show evidence that SNS controls the FDR when neighbors are present and that SNS outperforms group-FDR, the only other method that appears to control the FDR relative to a subset of relevant peptides.


Asunto(s)
Algoritmos , Espectrometría de Masas en Tándem , Bases de Datos de Proteínas , Humanos , Péptidos , Proteómica
19.
Environ Microbiol ; 23(7): 3840-3866, 2021 07.
Artículo en Inglés | MEDLINE | ID: mdl-33760340

RESUMEN

Colwellia psychrerythraea is a marine psychrophilic bacterium known for its remarkable ability to maintain activity during long-term exposure to extreme subzero temperatures and correspondingly high salinities in sea ice. These microorganisms must have adaptations to both high salinity and low temperature to survive, be metabolically active, or grow in the ice. Here, we report on an experimental design that allowed us to monitor culturability, cell abundance, activity and proteomic signatures of C. psychrerythraea strain 34H (Cp34H) in subzero brines and supercooled sea water through long-term incubations under eight conditions with varying subzero temperatures, salinities and nutrient additions. Shotgun proteomics found novel metabolic strategies used to maintain culturability in response to each independent experimental variable, particularly in pathways regulating carbon, nitrogen and fatty acid metabolism. Statistical analysis of abundances of proteins uniquely identified in isolated conditions provide metabolism-specific protein biosignatures indicative of growth or survival in either increased salinity, decreased temperature, or nutrient limitation. Additionally, to aid in the search for extant life on other icy worlds, analysis of detected short peptides in -10°C incubations after 4 months identified over 500 potential biosignatures that could indicate the presence of terrestrial-like cold-active or halophilic metabolisms on other icy worlds.


Asunto(s)
Alteromonadaceae , Proteómica , Alteromonadaceae/genética , Biomarcadores , Frío
20.
Methods ; 170: 61-68, 2020 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-31536770

RESUMEN

The highly dynamic nature of chromosome conformation and three-dimensional (3D) genome organization leads to cell-to-cell variability in chromatin interactions within a cell population, even if the cells of the population appear to be functionally homogeneous. Hence, although Hi-C is a powerful tool for mapping 3D genome organization, this heterogeneity of chromosome higher order structure among individual cells limits the interpretive power of population based bulk Hi-C assays. Moreover, single-cell studies have the potential to enable the identification and characterization of rare cell populations or cell subtypes in a heterogeneous population. However, it may require surveying relatively large numbers of single cells to achieve statistically meaningful observations in single-cell studies. By applying combinatorial cellular indexing to chromosome conformation capture, we developed single-cell combinatorial indexed Hi-C (sci-Hi-C), a high throughput method that enables mapping chromatin interactomes in large number of single cells. We demonstrated the use of sci-Hi-C data to separate cells by karytoypic and cell-cycle state differences and to identify cellular variability in mammalian chromosomal conformation. Here, we provide a detailed description of method design and step-by-step working protocols for sci-Hi-C.


Asunto(s)
Mapeo Cromosómico/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de la Célula Individual/métodos , Animales , Línea Celular , Núcleo Celular/genética , Separación Celular/métodos , Cromatina/genética , Cromatina/aislamiento & purificación , Cromatina/metabolismo , Simulación por Computador , Biblioteca de Genes , Humanos , Ratones , Conformación de Ácido Nucleico
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA