Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
1.
Bioinformatics ; 39(2)2023 02 03.
Artículo en Inglés | MEDLINE | ID: mdl-36702456

RESUMEN

MOTIVATION: Interpretation of newly acquired mass spectrometry data can be improved by identifying, from an online repository, previous mass spectrometry runs that resemble the new data. However, this retrieval task requires computing the similarity between an arbitrary pair of mass spectrometry runs. This is particularly challenging for runs acquired using different experimental protocols. RESULTS: We propose a method, MS1Connect, that calculates the similarity between a pair of runs by examining only the intact peptide (MS1) scans, and we show evidence that the MS1Connect score is accurate. Specifically, we show that MS1Connect outperforms several baseline methods on the task of predicting the species from which a given proteomics sample originated. In addition, we show that MS1Connect scores are highly correlated with similarities computed from fragment (MS2) scans, even though these data are not used by MS1Connect. AVAILABILITY AND IMPLEMENTATION: The MS1Connect software is available at https://github.com/bmx8177/MS1Connect. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Péptidos , Programas Informáticos , Espectrometría de Masas , Péptidos/química , Proteómica/métodos
2.
Bioinformatics ; 38(Suppl_2): ii148-ii154, 2022 09 16.
Artículo en Inglés | MEDLINE | ID: mdl-36124797

RESUMEN

MOTIVATION: A wide variety of experimental methods are available to characterize different properties of single cells in a complex biosample. However, because these measurement techniques are typically destructive, researchers are often presented with complementary measurements from disjoint subsets of cells, providing a fragmented view of the cell's biological processes. This creates a need for computational tools capable of integrating disjoint multi-omics data. Because different measurements typically do not share any features, the problem requires the integration to be done in unsupervised fashion. Recently, several methods have been proposed that project the cell measurements into a common latent space and attempt to align the corresponding low-dimensional manifolds. RESULTS: In this study, we present an approach, Synmatch, which produces a direct matching of the cells between modalities by exploiting information about neighborhood structure in each modality. Synmatch relies on the intuition that cells which are close in one measurement space should be close in the other as well. This allows us to formulate the matching problem as a constrained supermodular optimization problem over neighborhood structures that can be solved efficiently. We show that our approach successfully matches cells in small real multi-omics datasets and performs favorably when compared with recently published state-of-the-art methods. Further, we demonstrate that Synmatch is capable of scaling to large datasets of thousands of cells. AVAILABILITY AND IMPLEMENTATION: The Synmatch code and data used in this manuscript are available at https://github.com/Noble-Lab/synmatch.


Asunto(s)
Células
3.
Bioinformatics ; 34(4): 669-671, 2018 02 15.
Artículo en Inglés | MEDLINE | ID: mdl-29028889

RESUMEN

Summary: Segway performs semi-automated genome annotation, discovering joint patterns across multiple genomic signal datasets. We discuss a major new version of Segway and highlight its ability to model data with substantially greater accuracy. Major enhancements in Segway 2.0 include the ability to model data with a mixture of Gaussians, enabling capture of arbitrarily complex signal distributions, and minibatch training, leading to better learned parameters. Availability and implementation: Segway and its source code are freely available for download at http://segway.hoffmanlab.org. We have made available scripts (https://doi.org/10.5281/zenodo.802939) and datasets (https://doi.org/10.5281/zenodo.802906) for this paper's analysis. Contact: michael.hoffman@utoronto.ca. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genómica/métodos , Anotación de Secuencia Molecular/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Eucariontes/genética
4.
Proteins ; 86(4): 454-466, 2018 04.
Artículo en Inglés | MEDLINE | ID: mdl-29345009

RESUMEN

Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.


Asunto(s)
Algoritmos , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Análisis por Conglomerados , Proteómica/métodos
5.
Genome Res ; 25(4): 544-57, 2015 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-25677182

RESUMEN

The genomic neighborhood of a gene influences its activity, a behavior that is attributable in part to domain-scale regulation. Previous genomic studies have identified many types of regulatory domains. However, due to the difficulty of integrating genomics data sets, the relationships among these domain types are poorly understood. Semi-automated genome annotation (SAGA) algorithms facilitate human interpretation of heterogeneous collections of genomics data by simultaneously partitioning the human genome and assigning labels to the resulting genomic segments. However, existing SAGA methods cannot integrate inherently pairwise chromatin conformation data. We developed a new computational method, called graph-based regularization (GBR), for expressing a pairwise prior that encourages certain pairs of genomic loci to receive the same label in a genome annotation. We used GBR to exploit chromatin conformation information during genome annotation by encouraging positions that are close in 3D to occupy the same type of domain. Using this approach, we produced a model of chromatin domains in eight human cell types, thereby revealing the relationships among known domain types. Through this model, we identified clusters of tightly regulated genes expressed in only a small number of cell types, which we term "specific expression domains." We found that domain boundaries marked by promoters and CTCF motifs are consistent between cell types even when domain activity changes. Finally, we showed that GBR can be used to transfer information from well-studied cell types to less well-characterized cell types during genome annotation, making it possible to produce high-quality annotations of the hundreds of cell types with limited available data.


Asunto(s)
Cromatina/genética , Biología Computacional/métodos , Genómica/métodos , Conformación Molecular , Anotación de Secuencia Molecular/métodos , Algoritmos , Secuencias de Aminoácidos/genética , Línea Celular Tumoral , Cromatina/metabolismo , Estructuras Cromosómicas , Genoma Humano/genética , Células HeLa , Células Hep G2 , Células Endoteliales de la Vena Umbilical Humana , Humanos , Regiones Promotoras Genéticas/genética
6.
Nucleic Acids Res ; 41(2): 827-41, 2013 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-23221638

RESUMEN

The ENCODE Project has generated a wealth of experimental information mapping diverse chromatin properties in several human cell lines. Although each such data track is independently informative toward the annotation of regulatory elements, their interrelations contain much richer information for the systematic annotation of regulatory elements. To uncover these interrelations and to generate an interpretable summary of the massive datasets of the ENCODE Project, we apply unsupervised learning methodologies, converting dozens of chromatin datasets into discrete annotation maps of regulatory regions and other chromatin elements across the human genome. These methods rediscover and summarize diverse aspects of chromatin architecture, elucidate the interplay between chromatin activity and RNA transcription, and reveal that a large proportion of the genome lies in a quiescent state, even across multiple cell types. The resulting annotation of non-coding regulatory elements correlate strongly with mammalian evolutionary constraint, and provide an unbiased approach for evaluating metrics of evolutionary constraint in human. Lastly, we use the regulatory annotations to revisit previously uncharacterized disease-associated loci, resulting in focused, testable hypotheses through the lens of the chromatin landscape.


Asunto(s)
Cromatina/química , Genoma Humano , Anotación de Secuencia Molecular , Elementos Reguladores de la Transcripción , Elementos de Facilitación Genéticos , Estudio de Asociación del Genoma Completo , Humanos , Elementos Aisladores , Regiones Promotoras Genéticas , Proteínas/genética , Regiones Terminadoras Genéticas , Transcripción Genética
7.
Genome Biol ; 20(1): 180, 2019 08 28.
Artículo en Inglés | MEDLINE | ID: mdl-31462275

RESUMEN

Semi-automated genome annotation methods such as Segway take as input a set of genome-wide measurements such as of histone modification or DNA accessibility and output an annotation of genomic activity in the target cell type. Here we present annotations of 164 human cell types using 1615 data sets. To produce these annotations, we automated the label interpretation step to produce a fully automated annotation strategy. Using these annotations, we developed a measure of the importance of each genomic position called the "conservation-associated activity score." We further combined all annotations into a single, cell type-agnostic encyclopedia that catalogs all human regulatory elements.


Asunto(s)
ADN/genética , Bases de Datos Genéticas , Anotación de Secuencia Molecular , Algoritmos , Automatización , Línea Celular , Humanos , Aprendizaje Automático , Fenotipo , Transcripción Genética
9.
Sci Rep ; 7(1): 16943, 2017 12 05.
Artículo en Inglés | MEDLINE | ID: mdl-29208983

RESUMEN

A comprehensive characterization of tumor genetic heterogeneity is critical for understanding how cancers evolve and escape treatment. Although many algorithms have been developed for capturing tumor heterogeneity, they are designed for analyzing either a single type of genomic aberration or individual biopsies. Here we present THEMIS (Tumor Heterogeneity Extensible Modeling via an Integrative System), which allows for the joint analysis of different types of genomic aberrations from multiple biopsies taken from the same patient, using a dynamic graphical model. Simulation experiments demonstrate higher accuracy of THEMIS over its ancestor, TITAN. The heterogeneity analysis results from THEMIS are validated with single cell DNA sequencing from a clinical tumor biopsy. When THEMIS is used to analyze tumor heterogeneity among multiple biopsies from the same patient, it helps to reveal the mutation accumulation history, track cancer progression, and identify the mutations related to treatment resistance. We implement our model via an extensible modeling platform, which makes our approach open, reproducible, and easy for others to extend.


Asunto(s)
Biopsia/métodos , Modelos Biológicos , Neoplasias/patología , Neoplasias de la Mama Triple Negativas/tratamiento farmacológico , Neoplasias de la Mama Triple Negativas/genética , Algoritmos , Teorema de Bayes , Evolución Clonal , Biología Computacional/métodos , Variaciones en el Número de Copia de ADN , Femenino , Humanos , Mutación , Neoplasias/genética , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN , Análisis de la Célula Individual , Transcriptoma , Neoplasias de la Mama Triple Negativas/patología
10.
Genome Biol ; 17(1): 229, 2016 11 15.
Artículo en Inglés | MEDLINE | ID: mdl-27846892

RESUMEN

Due to the high cost of sequencing-based genomics assays such as ChIP-seq and DNase-seq, the epigenomic characterization of a cell type is typically carried out using a small panel of assay types. Deciding a priori which assays to perform is, thus, a critical step in many studies. We present the submodular selection of assays (SSA), a method for choosing a diverse panel of genomic assays that leverages methods from submodular optimization. More generally, this application serves as a model for how submodular optimization can be applied to other discrete problems in biology.


Asunto(s)
Genoma , Genómica/métodos , Sitios de Unión , Inmunoprecipitación de Cromatina , Bases de Datos de Ácidos Nucleicos , Epigenómica/métodos , Genómica/normas , Secuenciación de Nucleótidos de Alto Rendimiento , Histonas/metabolismo , Humanos , Unión Proteica , Factores de Transcripción/metabolismo
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA