Búsqueda | OPS/OMS Uruguay

1.

A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens.

Gasperini, Molly; Hill, Andrew J; McFaline-Figueroa, José L; Martin, Beth; Kim, Seungsoo; Zhang, Melissa D; Jackson, Dana; Leith, Anh; Schreiber, Jacob; Noble, William S; Trapnell, Cole; Ahituv, Nadav; Shendure, Jay.

Cell ; 176(1-2): 377-390.e19, 2019 01 10.

Artículo en Inglés | MEDLINE | ID: mdl-30612741

RESUMEN

Over one million candidate regulatory elements have been identified across the human genome, but nearly all are unvalidated and their target genes uncertain. Approaches based on human genetics are limited in scope to common variants and in resolution by linkage disequilibrium. We present a multiplex, expression quantitative trait locus (eQTL)-inspired framework for mapping enhancer-gene pairs by introducing random combinations of CRISPR/Cas9-mediated perturbations to each of many cells, followed by single-cell RNA sequencing (RNA-seq). Across two experiments, we used dCas9-KRAB to perturb 5,920 candidate enhancers with no strong a priori hypothesis as to their target gene(s), measuring effects by profiling 254,974 single-cell transcriptomes. We identified 664 (470 high-confidence) cis enhancer-gene pairs, which were enriched for specific transcription factors, non-housekeeping status, and genomic and 3D conformational proximity to their target genes. This framework will facilitate the large-scale mapping of enhancer-gene regulatory interactions, a critical yet largely uncharted component of the cis-regulatory landscape of the human genome.

Asunto(s)

Mapeo Cromosómico/métodos , Elementos de Facilitación Genéticos/genética , Regulación de la Expresión Génica/genética , Sistemas CRISPR-Cas/genética , Repeticiones Palindrómicas Cortas Agrupadas y Regularmente Espaciadas/genética , Perfilación de la Expresión Génica , Redes Reguladoras de Genes/genética , Genoma Humano , Estudio de Asociación del Genoma Completo , Genómica , Humanos , Sitios de Carácter Cuantitativo , Factores de Transcripción/genética

2.

Spatial and temporal organization of the genome: Current state and future aims of the 4D nucleome project.

Dekker, Job; Alber, Frank; Aufmkolk, Sarah; Beliveau, Brian J; Bruneau, Benoit G; Belmont, Andrew S; Bintu, Lacramioara; Boettiger, Alistair; Calandrelli, Riccardo; Disteche, Christine M; Gilbert, David M; Gregor, Thomas; Hansen, Anders S; Huang, Bo; Huangfu, Danwei; Kalhor, Reza; Leslie, Christina S; Li, Wenbo; Li, Yun; Ma, Jian; Noble, William S; Park, Peter J; Phillips-Cremins, Jennifer E; Pollard, Katherine S; Rafelski, Susanne M; Ren, Bing; Ruan, Yijun; Shav-Tal, Yaron; Shen, Yin; Shendure, Jay; Shu, Xiaokun; Strambio-De-Castillia, Caterina; Vertii, Anastassiia; Zhang, Huaiying; Zhong, Sheng.

Mol Cell ; 83(15): 2624-2640, 2023 08 03.

Artículo en Inglés | MEDLINE | ID: mdl-37419111

RESUMEN

The four-dimensional nucleome (4DN) consortium studies the architecture of the genome and the nucleus in space and time. We summarize progress by the consortium and highlight the development of technologies for (1) mapping genome folding and identifying roles of nuclear components and bodies, proteins, and RNA, (2) characterizing nuclear organization with time or single-cell resolution, and (3) imaging of nuclear organization. With these tools, the consortium has provided over 2,000 public datasets. Integrative computational models based on these data are starting to reveal connections between genome structure and function. We then present a forward-looking perspective and outline current aims to (1) delineate dynamics of nuclear architecture at different timescales, from minutes to weeks as cells differentiate, in populations and in single cells, (2) characterize cis-determinants and trans-modulators of genome organization, (3) test functional consequences of changes in cis- and trans-regulators, and (4) develop predictive models of genome structure and function.

Asunto(s)

Núcleo Celular , Genoma , Genoma/genética , Núcleo Celular/genética , Núcleo Celular/metabolismo , Cromatina/metabolismo

3.

A single-cell time-lapse of mouse prenatal development from gastrula to birth.

Qiu, Chengxiang; Martin, Beth K; Welsh, Ian C; Daza, Riza M; Le, Truc-Mai; Huang, Xingfan; Nichols, Eva K; Taylor, Megan L; Fulton, Olivia; O'Day, Diana R; Gomes, Anne Roshella; Ilcisin, Saskia; Srivatsan, Sanjay; Deng, Xinxian; Disteche, Christine M; Noble, William Stafford; Hamazaki, Nobuhiko; Moens, Cecilia B; Kimelman, David; Cao, Junyue; Schier, Alexander F; Spielmann, Malte; Murray, Stephen A; Trapnell, Cole; Shendure, Jay.

Nature ; 626(8001): 1084-1093, 2024 Feb.

Artículo en Inglés | MEDLINE | ID: mdl-38355799

RESUMEN

The house mouse (Mus musculus) is an exceptional model system, combining genetic tractability with close evolutionary affinity to humans1,2. Mouse gestation lasts only 3 weeks, during which the genome orchestrates the astonishing transformation of a single-cell zygote into a free-living pup composed of more than 500 million cells. Here, to establish a global framework for exploring mammalian development, we applied optimized single-cell combinatorial indexing3 to profile the transcriptional states of 12.4 million nuclei from 83 embryos, precisely staged at 2- to 6-hour intervals spanning late gastrulation (embryonic day 8) to birth (postnatal day 0). From these data, we annotate hundreds of cell types and explore the ontogenesis of the posterior embryo during somitogenesis and of kidney, mesenchyme, retina and early neurons. We leverage the temporal resolution and sampling depth of these whole-embryo snapshots, together with published data4-8 from earlier timepoints, to construct a rooted tree of cell-type relationships that spans the entirety of prenatal development, from zygote to birth. Throughout this tree, we systematically nominate genes encoding transcription factors and other proteins as candidate drivers of the in vivo differentiation of hundreds of cell types. Remarkably, the most marked temporal shifts in cell states are observed within one hour of birth and presumably underlie the massive physiological adaptations that must accompany the successful transition of a mammalian fetus to life outside the womb.

Asunto(s)

Animales Recién Nacidos , Embrión de Mamíferos , Desarrollo Embrionario , Gástrula , Análisis de la Célula Individual , Imagen de Lapso de Tiempo , Animales , Femenino , Ratones , Embarazo , Animales Recién Nacidos/embriología , Animales Recién Nacidos/genética , Diferenciación Celular/genética , Embrión de Mamíferos/citología , Embrión de Mamíferos/embriología , Desarrollo Embrionario/genética , Gástrula/citología , Gástrula/embriología , Gastrulación/genética , Riñón/citología , Riñón/embriología , Mesodermo/citología , Mesodermo/enzimología , Neuronas/citología , Neuronas/metabolismo , Retina/citología , Retina/embriología , Somitos/citología , Somitos/embriología , Factores de Tiempo , Factores de Transcripción/genética , Transcripción Genética , Especificidad de Órganos/genética

4.

A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens.

Gasperini, Molly; Hill, Andrew J; McFaline-Figueroa, José L; Martin, Beth; Kim, Seungsoo; Zhang, Melissa D; Jackson, Dana; Leith, Anh; Schreiber, Jacob; Noble, William S; Trapnell, Cole; Ahituv, Nadav; Shendure, Jay.

Cell ; 176(6): 1516, 2019 Mar 07.

Artículo en Inglés | MEDLINE | ID: mdl-30849375

5.

Navigating the pitfalls of applying machine learning in genomics.

Whalen, Sean; Schreiber, Jacob; Noble, William S; Pollard, Katherine S.

Nat Rev Genet ; 23(3): 169-181, 2022 03.

Artículo en Inglés | MEDLINE | ID: mdl-34837041

RESUMEN

The scale of genetic, epigenomic, transcriptomic, cheminformatic and proteomic data available today, coupled with easy-to-use machine learning (ML) toolkits, has propelled the application of supervised learning in genomics research. However, the assumptions behind the statistical models and performance evaluations in ML software frequently are not met in biological systems. In this Review, we illustrate the impact of several common pitfalls encountered when applying supervised ML in genomics. We explore how the structure of genomics data can bias performance evaluations and predictions. To address the challenges associated with applying cutting-edge ML methods to genomics, we describe solutions and appropriate use cases where ML modelling shows great potential.

Asunto(s)

Genómica/métodos , Aprendizaje Automático , Animales , Genómica/normas , Genómica/tendencias , Humanos , Aprendizaje Automático/normas , Modelos Estadísticos , Programas Informáticos

6.

A High-Throughput Screen for Transcription Activation Domains Reveals Their Sequence Features and Permits Prediction by Deep Learning.

Erijman, Ariel; Kozlowski, Lukasz; Sohrabi-Jahromi, Salma; Fishburn, James; Warfield, Linda; Schreiber, Jacob; Noble, William S; Söding, Johannes; Hahn, Steven.

Mol Cell ; 78(5): 890-902.e6, 2020 06 04.

Artículo en Inglés | MEDLINE | ID: mdl-32416068

RESUMEN

Acidic transcription activation domains (ADs) are encoded by a wide range of seemingly unrelated amino acid sequences, making it difficult to recognize features that promote their dynamic behavior, "fuzzy" interactions, and target specificity. We screened a large set of random 30-mer peptides for AD function in yeast and trained a deep neural network (ADpred) on the AD-positive and -negative sequences. ADpred identifies known acidic ADs within transcription factors and accurately predicts the consequences of mutations. Our work reveals that strong acidic ADs contain multiple clusters of hydrophobic residues near acidic side chains, explaining why ADs often have a biased amino acid composition. ADs likely use a binding mechanism similar to avidity where a minimum number of weak dynamic interactions are required between activator and target to generate biologically relevant affinity and in vivo function. This mechanism explains the basis for fuzzy binding observed between acidic ADs and targets.

Asunto(s)

Ensayos Analíticos de Alto Rendimiento/métodos , Factores de Transcripción/genética , Activación Transcripcional/genética , Secuencia de Aminoácidos/genética , Factores de Transcripción con Cremalleras de Leucina de Carácter Básico/genética , Proteínas de Unión al ADN/metabolismo , Aprendizaje Profundo , Unión Proteica , Dominios Proteicos/genética , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo , Transactivadores/genética , Transactivadores/metabolismo , Factores de Transcripción/metabolismo , Activación Transcripcional/fisiología

7.

Systematic identification of interchromosomal interaction networks supports the existence of specialized RNA factories.

Hristov, Borislav H; Noble, William Stafford; Bertero, Alessandro.

Genome Res ; 2024 Sep 25.

Artículo en Inglés | MEDLINE | ID: mdl-39322282

RESUMEN

Most studies of genome organization have focused on intrachromosomal (cis) contacts because they harbor key features such as DNA loops and topologically associating domains. Interchromosomal (trans) contacts have received much less attention, and tools for interrogating potential biologically relevant trans structures are lacking. Here, we develop a computational framework that uses Hi-C data to identify sets of loci that jointly interact in trans This method, trans-C, initiates probabilistic random walks with restarts from a set of seed loci to traverse an input Hi-C contact network, thereby identifying sets of trans-contacting loci. We validate trans-C in three increasingly complex models of established trans contacts: the Plasmodium falciparum var genes, the mouse olfactory receptor "Greek islands", and the human RBM20 cardiac splicing factory. We then apply trans-C to systematically test the hypothesis that genes coregulated by the same trans-acting element (i.e., a transcription or splicing factor) colocalize in three dimensions to form "RNA factories" that maximize the efficiency and accuracy of RNA biogenesis. We find that many loci with multiple binding sites of the same DNA binding proteins interact with one another in trans, especially those bound by factors with intrinsically disordered domains. Similarly, clustered binding of a subset of RNA-binding proteins correlates with trans interaction of the encoding loci. Intriguingly, we observe that these trans-interacting loci are close to nuclear speckles. Our findings support the existence of trans interacting chromatin domains (TIDs) driven by RNA biogenesis. Trans-C provides an efficient computational framework for studying these and other types of trans interactions, empowering studies of a poorly understood aspect of genome architecture.

8.

DNA-m6A calling and integrated long-read epigenetic and genetic analysis with fibertools.

Jha, Anupama; Bohaczuk, Stephanie C; Mao, Yizi; Ranchalis, Jane; Mallory, Benjamin J; Min, Alan T; Hamm, Morgan O; Swanson, Elliott; Dubocanin, Danilo; Finkbeiner, Connor; Li, Tony; Whittington, Dale; Noble, William Stafford; Stergachis, Andrew Ben; Vollger, Mitchell R.

Genome Res ; 2024 Jun 07.

Artículo en Inglés | MEDLINE | ID: mdl-38849157

RESUMEN

Long-read DNA sequencing has recently emerged as a powerful tool for studying both genetic and epigenetic architectures at single-molecule and single-nucleotide resolution. Long-read epigenetic studies encompass both the direct identification of native cytosine methylation as well as the identification of exogenously placed DNA N6-methyladenine (DNA-m6A). However, detecting DNA-m6A modifications using single-molecule sequencing, as well as coprocessing single-molecule genetic and epigenetic architectures, is limited by computational demands and a lack of supporting tools. Here, we introduce fibertools, a state-of-the-art toolkit that features a semisupervised convolutional neural network for fast and accurate identification of m6A-marked bases using PacBio single-molecule long-read sequencing, as well as the coprocessing of long-read genetic and epigenetic data produced using either PacBio or Oxford Nanopore sequencing platforms. We demonstrate accurate DNA-m6A identification (>90% precision and recall) along >20 kilobase long DNA molecules with a ~1,000-fold improvement in speed. In addition, we demonstrate that fibertools can readily integrate genetic and epigenetic data at single-molecule resolution, including the seamless conversion between molecular and reference coordinate systems, allowing for accurate genetic and epigenetic analyses of long-read data within structurally and somatically variable genomic regions.

9.

High-Throughput Single-Cell Sequencing with Linear Amplification.

Yin, Yi; Jiang, Yue; Lam, Kwan-Wood Gabriel; Berletch, Joel B; Disteche, Christine M; Noble, William S; Steemers, Frank J; Camerini-Otero, R Daniel; Adey, Andrew C; Shendure, Jay.

Mol Cell ; 76(4): 676-690.e10, 2019 11 21.

Artículo en Inglés | MEDLINE | ID: mdl-31495564

RESUMEN

Conventional methods for single-cell genome sequencing are limited with respect to uniformity and throughput. Here, we describe sci-L3, a single-cell sequencing method that combines combinatorial indexing (sci-) and linear (L) amplification. The sci-L3 method adopts a 3-level (3) indexing scheme that minimizes amplification biases while enabling exponential gains in throughput. We demonstrate the generalizability of sci-L3 with proof-of-concept demonstrations of single-cell whole-genome sequencing (sci-L3-WGS), targeted sequencing (sci-L3-target-seq), and a co-assay of the genome and transcriptome (sci-L3-RNA/DNA). We apply sci-L3-WGS to profile the genomes of >10,000 sperm and sperm precursors from F1 hybrid mice, mapping 86,786 crossovers and characterizing rare chromosome mis-segregation events in meiosis, including instances of whole-genome equational chromosome segregation. We anticipate that sci-L3 assays can be applied to fully characterize recombination landscapes, to couple CRISPR perturbations and measurements of genome stability, and to other goals requiring high-throughput, high-coverage single-cell sequencing.

Asunto(s)

Perfilación de la Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento , Técnicas de Amplificación de Ácido Nucleico , Análisis de Secuencia de ADN , Análisis de Secuencia de ARN , Análisis de la Célula Individual/métodos , Secuenciación Completa del Genoma , Animales , Segregación Cromosómica , Masculino , Meiosis/genética , Ratones , Prueba de Estudio Conceptual , Espermatozoides/fisiología , Transcriptoma , Flujo de Trabajo

10.

Expanded encyclopaedias of DNA elements in the human and mouse genomes.

Moore, Jill E; Purcaro, Michael J; Pratt, Henry E; Epstein, Charles B; Shoresh, Noam; Adrian, Jessika; Kawli, Trupti; Davis, Carrie A; Dobin, Alexander; Kaul, Rajinder; Halow, Jessica; Van Nostrand, Eric L; Freese, Peter; Gorkin, David U; Shen, Yin; He, Yupeng; Mackiewicz, Mark; Pauli-Behn, Florencia; Williams, Brian A; Mortazavi, Ali; Keller, Cheryl A; Zhang, Xiao-Ou; Elhajjajy, Shaimae I; Huey, Jack; Dickel, Diane E; Snetkova, Valentina; Wei, Xintao; Wang, Xiaofeng; Rivera-Mulia, Juan Carlos; Rozowsky, Joel; Zhang, Jing; Chhetri, Surya B; Zhang, Jialing; Victorsen, Alec; White, Kevin P; Visel, Axel; Yeo, Gene W; Burge, Christopher B; Lécuyer, Eric; Gilbert, David M; Dekker, Job; Rinn, John; Mendenhall, Eric M; Ecker, Joseph R; Kellis, Manolis; Klein, Robert J; Noble, William S; Kundaje, Anshul; Guigó, Roderic; Farnham, Peggy J.

Nature ; 583(7818): 699-710, 2020 07.

Artículo en Inglés | MEDLINE | ID: mdl-32728249

RESUMEN

The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.

Asunto(s)

ADN/genética , Bases de Datos Genéticas , Genoma/genética , Genómica , Anotación de Secuencia Molecular , Sistema de Registros , Secuencias Reguladoras de Ácidos Nucleicos/genética , Animales , Cromatina/genética , Cromatina/metabolismo , ADN/química , Huella de ADN , Metilación de ADN/genética , Momento de Replicación del ADN , Desoxirribonucleasa I/metabolismo , Genoma Humano , Histonas/metabolismo , Humanos , Ratones , Ratones Transgénicos , Proteínas de Unión al ARN/genética , Transcripción Genética/genética , Transposasas/metabolismo

11.

A learned embedding for efficient joint analysis of millions of mass spectra.

Bittremieux, Wout; May, Damon H; Bilmes, Jeffrey; Noble, William Stafford.

Nat Methods ; 19(6): 675-678, 2022 06.

Artículo en Inglés | MEDLINE | ID: mdl-35637305

RESUMEN

Computational methods that aim to exploit publicly available mass spectrometry repositories rely primarily on unsupervised clustering of spectra. Here we trained a deep neural network in a supervised fashion on the basis of previous assignments of peptides to spectra. The network, called 'GLEAMS', learns to embed spectra in a low-dimensional space in which spectra generated by the same peptide are close to one another. We applied GLEAMS for large-scale spectrum clustering, detecting groups of unidentified, proximal spectra representing the same peptide. We used these clusters to explore the dark proteome of repeatedly observed yet consistently unidentified mass spectra.

Asunto(s)

Péptidos , Espectrometría de Masas en Tándem , Algoritmos , Análisis por Conglomerados , Redes Neurales de la Computación , Péptidos/química , Proteoma/análisis , Espectrometría de Masas en Tándem/métodos

12.

A learned score function improves the power of mass spectrometry database search.

Ananth, Varun; Sanders, Justin; Yilmaz, Melih; Wen, Bo; Oh, Sewoong; Noble, William Stafford.

Bioinformatics ; 40(Suppl 1): i410-i417, 2024 06 28.

Artículo en Inglés | MEDLINE | ID: mdl-38940129

RESUMEN

MOTIVATION: One of the core problems in the analysis of protein tandem mass spectrometry data is the peptide assignment problem: determining, for each observed spectrum, the peptide sequence that was responsible for generating the spectrum. Two primary classes of methods are used to solve this problem: database search and de novo peptide sequencing. State-of-the-art methods for de novo sequencing use machine learning methods, whereas most database search engines use hand-designed score functions to evaluate the quality of a match between an observed spectrum and a candidate peptide from the database. We hypothesized that machine learning models for de novo sequencing implicitly learn a score function that captures the relationship between peptides and spectra, and thus may be re-purposed as a score function for database search. Because this score function is trained from massive amounts of mass spectrometry data, it could potentially outperform existing, hand-designed database search tools. RESULTS: To test this hypothesis, we re-engineered Casanovo, which has been shown to provide state-of-the-art de novo sequencing capabilities, to assign scores to given peptide-spectrum pairs. We then evaluated the statistical power of this Casanovo score function, Casanovo-DB, to detect peptides on a benchmark of three mass spectrometry runs from three different species. In addition, we show that re-scoring with the Percolator post-processor benefits Casanovo-DB more than other score functions, further increasing the number of detected peptides.

Asunto(s)

Bases de Datos de Proteínas , Péptidos , Péptidos/química , Aprendizaje Automático , Espectrometría de Masas/métodos , Algoritmos , Análisis de Secuencia de Proteína/métodos , Espectrometría de Masas en Tándem/métodos

13.

Enhancing Hi-C contact matrices for loop detection with Capricorn: a multiview diffusion model.

Fang, Tangqi; Liu, Yifeng; Woicik, Addie; Lu, Minsi; Jha, Anupama; Wang, Xiao; Li, Gang; Hristov, Borislav; Liu, Zixuan; Xu, Hanwen; Noble, William S; Wang, Sheng.

Bioinformatics ; 40(Suppl 1): i471-i480, 2024 06 28.

Artículo en Inglés | MEDLINE | ID: mdl-38940142

RESUMEN

MOTIVATION: High-resolution Hi-C contact matrices reveal the detailed three-dimensional architecture of the genome, but high-coverage experimental Hi-C data are expensive to generate. Simultaneously, chromatin structure analyses struggle with extremely sparse contact matrices. To address this problem, computational methods to enhance low-coverage contact matrices have been developed, but existing methods are largely based on resolution enhancement methods for natural images and hence often employ models that do not distinguish between biologically meaningful contacts, such as loops and other stochastic contacts. RESULTS: We present Capricorn, a machine learning model for Hi-C resolution enhancement that incorporates small-scale chromatin features as additional views of the input Hi-C contact matrix and leverages a diffusion probability model backbone to generate a high-coverage matrix. We show that Capricorn outperforms the state of the art in a cross-cell-line setting, improving on existing methods by 17% in mean squared error and 26% in F1 score for chromatin loop identification from the generated high-coverage data. We also demonstrate that Capricorn performs well in the cross-chromosome setting and cross-chromosome, cross-cell-line setting, improving the downstream loop F1 score by 14% relative to existing methods. We further show that our multiview idea can also be used to improve several existing methods, HiCARN and HiCNN, indicating the wide applicability of this approach. Finally, we use DNA sequence to validate discovered loops and find that the fraction of CTCF-supported loops from Capricorn is similar to those identified from the high-coverage data. Capricorn is a powerful Hi-C resolution enhancement method that enables scientists to find chromatin features that cannot be identified in the low-coverage contact matrix. AVAILABILITY AND IMPLEMENTATION: Implementation of Capricorn and source code for reproducing all figures in this paper are available at https://github.com/CHNFTQ/Capricorn.

Asunto(s)

Cromatina , Aprendizaje Automático , Cromatina/química , Cromatina/metabolismo , Humanos , Biología Computacional/métodos , Algoritmos , Programas Informáticos

14.

Target-decoy false discovery rate estimation using Crema.

Lin, Andy; See, Donavan; Fondrie, William E; Keich, Uri; Noble, William Stafford.

Proteomics ; 24(8): e2300084, 2024 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-38380501

RESUMEN

Assigning statistical confidence estimates to discoveries produced by a tandem mass spectrometry proteomics experiment is critical to enabling principled interpretation of the results and assessing the cost/benefit ratio of experimental follow-up. The most common technique for computing such estimates is to use target-decoy competition (TDC), in which observed spectra are searched against a database of real (target) peptides and a database of shuffled or reversed (decoy) peptides. TDC procedures for estimating the false discovery rate (FDR) at a given score threshold have been developed for application at the level of spectra, peptides, or proteins. Although these techniques are relatively straightforward to implement, it is common in the literature to skip over the implementation details or even to make mistakes in how the TDC procedures are applied in practice. Here we present Crema, an open-source Python tool that implements several TDC methods of spectrum-, peptide- and protein-level FDR estimation. Crema is compatible with a variety of existing database search tools and provides a straightforward way to obtain robust FDR estimates.

Asunto(s)

Algoritmos , Péptidos , Bases de Datos de Proteínas , Péptidos/química , Proteínas/análisis , Proteómica/métodos

15.

Analysis of Tandem Mass Spectrometry Data with CONGA: Combining Open and Narrow Searches with Group-Wise Analysis.

Freestone, Jack; Noble, William S; Keich, Uri.

J Proteome Res ; 23(6): 1894-1906, 2024 Jun 07.

Artículo en Inglés | MEDLINE | ID: mdl-38652578

RESUMEN

Searching for tandem mass spectrometry proteomics data against a database is a well-established method for assigning peptide sequences to observed spectra but typically cannot identify peptides harboring unexpected post-translational modifications (PTMs). Open modification searching aims to address this problem by allowing a spectrum to match a peptide even if the spectrum's precursor mass differs from the peptide mass. However, expanding the search space in this way can lead to a loss of statistical power to detect peptides. We therefore developed a method, called CONGA (combining open and narrow searches with group-wise analysis), that takes into account results from both types of searchesâa traditional "narrow window" search and an open modification searchâwhile carrying out rigorous false discovery rate control. The result is an algorithm that provides the best of both worlds: the ability to detect unexpected PTMs without a concomitant loss of power to detect unmodified peptides.

Asunto(s)

Algoritmos , Bases de Datos de Proteínas , Procesamiento Proteico-Postraduccional , Proteómica , Espectrometría de Masas en Tándem , Espectrometría de Masas en Tándem/métodos , Proteómica/métodos , Péptidos/análisis , Péptidos/química , Humanos , Programas Informáticos , Secuencia de Aminoácidos

16.

Reinvestigating the Correctness of Decoy-Based False Discovery Rate Control in Proteomics Tandem Mass Spectrometry.

Freestone, Jack; Noble, William Stafford; Keich, Uri.

J Proteome Res ; 23(6): 1907-1914, 2024 Jun 07.

Artículo en Inglés | MEDLINE | ID: mdl-38687997

RESUMEN

Traditional database search methods for the analysis of bottom-up proteomics tandem mass spectrometry (MS/MS) data are limited in their ability to detect peptides with post-translational modifications (PTMs). Recently, "open modification" database search strategies, in which the requirement that the mass of the database peptide closely matches the observed precursor mass is relaxed, have become popular as ways to find a wider variety of types of PTMs. Indeed, in one study, Kong et al. reported that the open modification search tool MSFragger can achieve higher statistical power to detect peptides than a traditional "narrow window" database search. We investigated this claim empirically and, in the process, uncovered a potential general problem with false discovery rate (FDR) control in the machine learning postprocessors Percolator and PeptideProphet. This problem might have contributed to Kong et al.'s report that their empirical results suggest that false discovery (FDR) control in the narrow window setting might generally be compromised. Indeed, reanalyzing the same data while using a more standard form of target-decoy competition-based FDR control, we found that, after accounting for chimeric spectra as well as for the inherent difference in the number of candidates in open and narrow searches, the data does not provide sufficient evidence that FDR control in proteomics MS/MS database search is inherently problematic.

Asunto(s)

Bases de Datos de Proteínas , Procesamiento Proteico-Postraduccional , Proteómica , Espectrometría de Masas en Tándem , Espectrometría de Masas en Tándem/métodos , Proteómica/métodos , Péptidos/análisis , Péptidos/química , Aprendizaje Automático , Humanos , Algoritmos , Programas Informáticos

17.

Accounting for Digestion Enzyme Bias in Casanovo.

Melendez, Carlo; Sanders, Justin; Yilmaz, Melih; Bittremieux, Wout; Fondrie, William E; Oh, Sewoong; Noble, William Stafford.

J Proteome Res ; 23(10): 4761-4769, 2024 Oct 04.

Artículo en Inglés | MEDLINE | ID: mdl-39213590

RESUMEN

A key parameter of any bottom-up proteomics mass spectrometry experiment is the identity of the enzyme that is used to digest proteins in the sample into peptides. The Casanovo de novo sequencing model was trained using data that was generated with trypsin digestion; consequently, the model prefers to predict peptides that end with the amino acids "K" or "R". This bias is desirable when Casanovo is used to analyze data that was also generated using trypsin but can be problematic if the data was generated using some other digestion enzyme. In this work, we modify Casanovo to take as input the identity of the digestion enzyme alongside each observed spectrum. We then train Casanovo with data generated by using several different enzymes, and we demonstrate that the resulting model successfully learns to capture enzyme-specific behavior. However, we find, surprisingly, that this new model does not yield a significant improvement in sequencing accuracy relative to a model trained without enzyme information but using the same training set. This observation may have important implications for future attempts to make use of experimental metadata in de novo sequencing models.

Asunto(s)

Proteómica , Tripsina , Proteómica/métodos , Tripsina/metabolismo , Tripsina/química , Espectrometría de Masas/métodos , Péptidos/metabolismo , Péptidos/química , Proteolisis

18.

A High-Throughput PIXUL-Matrix-Based Toolbox to Profile Frozen and Formalin-Fixed Paraffin-Embedded Tissues Multiomes.

Mar, Daniel; Babenko, Ilona M; Zhang, Ran; Noble, William Stafford; Denisenko, Oleg; Vaisar, Tomas; Bomsztyk, Karol.

Lab Invest ; 104(1): 100282, 2024 01.

Artículo en Inglés | MEDLINE | ID: mdl-37924947

RESUMEN

Large-scale high-dimensional multiomics studies are essential to unravel molecular complexity in health and disease. We developed an integrated system for tissue sampling (CryoGrid), analytes preparation (PIXUL), and downstream multiomic analysis in a 96-well plate format (Matrix), MultiomicsTracks96, which we used to interrogate matched frozen and formalin-fixed paraffin-embedded (FFPE) mouse organs. Using this system, we generated 8-dimensional omics data sets encompassing 4 molecular layers of intracellular organization: epigenome (H3K27Ac, H3K4m3, RNA polymerase II, and 5mC levels), transcriptome (messenger RNA levels), epitranscriptome (m6A levels), and proteome (protein levels) in brain, heart, kidney, and liver. There was a high correlation between data from matched frozen and FFPE organs. The Segway genome segmentation algorithm applied to epigenomic profiles confirmed known organ-specific superenhancers in both FFPE and frozen samples. Linear regression analysis showed that proteomic profiles, known to be poorly correlated with transcriptomic data, can be more accurately predicted by the full suite of multiomics data, compared with using epigenomic, transcriptomic, or epitranscriptomic measurements individually.

Asunto(s)

Formaldehído , Proteómica , Ratones , Animales , Fijadores , Fijación del Tejido/métodos , Proteómica/métodos , Adhesión en Parafina/métodos

19.

Comprehensive characterization of tissue-specific chromatin accessibility in L2 Caenorhabditis elegans nematodes.

Durham, Timothy J; Daza, Riza M; Gevirtzman, Louis; Cusanovich, Darren A; Bolonduro, Olubusayo; Noble, William Stafford; Shendure, Jay; Waterston, Robert H.

Genome Res ; 31(10): 1952-1969, 2021 10.

Artículo en Inglés | MEDLINE | ID: mdl-33888511

RESUMEN

Recently developed single-cell technologies allow researchers to characterize cell states at ever greater resolution and scale. Caenorhabditis elegans is a particularly tractable system for studying development, and recent single-cell RNA-seq studies characterized the gene expression patterns for nearly every cell type in the embryo and at the second larval stage (L2). Gene expression patterns give insight about gene function and into the biochemical state of different cell types; recent advances in other single-cell genomics technologies can now also characterize the regulatory context of the genome that gives rise to these gene expression levels at a single-cell resolution. To explore the regulatory DNA of individual cell types in C. elegans, we collected single-cell chromatin accessibility data using the sci-ATAC-seq assay in L2 larvae to match the available single-cell RNA-seq data set. By using a novel implementation of the latent Dirichlet allocation algorithm, we identify 37 clusters of cells that correspond to different cell types in the worm, providing new maps of putative cell type-specific gene regulatory sites, with promise for better understanding of cellular differentiation and gene regulation.

Asunto(s)

Caenorhabditis elegans , Cromatina , Animales , Caenorhabditis elegans/genética , Cromatina/genética , Secuenciación de Inmunoprecipitación de Cromatina , ADN/genética , Regulación de la Expresión Génica

20.

Inference of 3D genome architecture by modeling overdispersion of Hi-C data.

Varoquaux, Nelle; Noble, William S; Vert, Jean-Philippe.

Bioinformatics ; 39(1)2023 01 01.

Artículo en Inglés | MEDLINE | ID: mdl-36594573

RESUMEN

MOTIVATION: We address the challenge of inferring a consensus 3D model of genome architecture from Hi-C data. Existing approaches most often rely on a two-step algorithm: first, convert the contact counts into distances, then optimize an objective function akin to multidimensional scaling (MDS) to infer a 3D model. Other approaches use a maximum likelihood approach, modeling the contact counts between two loci as a Poisson random variable whose intensity is a decreasing function of the distance between them. However, a Poisson model of contact counts implies that the variance of the data is equal to the mean, a relationship that is often too restrictive to properly model count data. RESULTS: We first confirm the presence of overdispersion in several real Hi-C datasets, and we show that the overdispersion arises even in simulated datasets. We then propose a new model, called Pastis-NB, where we replace the Poisson model of contact counts by a negative binomial one, which is parametrized by a mean and a separate dispersion parameter. The dispersion parameter allows the variance to be adjusted independently from the mean, thus better modeling overdispersed data. We compare the results of Pastis-NB to those of several previously published algorithms, both MDS-based and statistical methods. We show that the negative binomial inference yields more accurate structures on simulated data, and more robust structures than other models across real Hi-C replicates and across different resolutions. AVAILABILITY AND IMPLEMENTATION: A Python implementation of Pastis-NB is available at https://github.com/hiclib/pastis under the BSD license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Algoritmos , Genoma , Funciones de Verosimilitud

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA