RESUMEN
BACKGROUND: Accurate identification of Transcriptional Regulator binding locations is essential for analysis of genomic regions, including Cis Regulatory Elements. The customary NGS approaches, predominantly ChIP-Seq, can be obscured by data anomalies and biases which are difficult to detect without supervision. RESULTS: Here, we develop a method to leverage the usual combinations between many experimental series to mark such atypical peaks. We use deep learning to perform a lossy compression of the genomic regions' representations with multiview convolutions. Using artificial data, we show that our method correctly identifies groups of correlating series and evaluates CRE according to group completeness. It is then applied to the ReMap database's large volume of curated ChIP-seq data. We show that peaks lacking known biological correlators are singled out and less confirmed in real data. We propose normalization approaches useful in interpreting black-box models. CONCLUSION: Our approach detects peaks that are less corroborated than average. It can be extended to other similar problems, and can be interpreted to identify correlation groups. It is implemented in an open-source tool called atyPeak.
Asunto(s)
Secuenciación de Inmunoprecipitación de Cromatina , Genómica , Secuencias Reguladoras de Ácidos NucleicosRESUMEN
Most epigenetic marks, such as Transcriptional Regulators or histone marks, are biological objects known to work together in n-wise complexes. A suitable way to infer such functional associations between them is to study the overlaps of the corresponding genomic regions. However, the problem of the statistical significance of n-wise overlaps of genomic features is seldom tackled, which prevent rigorous studies of n-wise interactions. We introduce OLOGRAM-MODL, which considers overlaps between n ≥ 2 sets of genomic regions, and computes their statistical mutual enrichment by Monte Carlo fitting of a Negative Binomial distribution, resulting in more resolutive P-values. An optional machine learning method is proposed to find complexes of interest, using a new itemset mining algorithm based on dictionary learning which is resistant to noise inherent to biological assays. The overall approach is implemented through an easy-to-use CLI interface for workflow integration, and a visual tree-based representation of the results suited for explicability. The viability of the method is experimentally studied using both artificial and biological data. This approach is accessible through the command line interface of the pygtftk toolkit, available on Bioconda and from https://github.com/dputhier/pygtftk.
RESUMEN
MOTIVATION: Complex biological functions emerge from interactions between proteins in stable supra-molecular assemblies and/or through transitory contacts. Most of the time protein partners of the assemblies are composed of one or several domains which exhibit different biochemical functions. Thus the study of cellular process requires the identification of different functional units and their integration in an interaction network; such complexes are referred to as integrated systems. In order to exploit with optimum efficiency the increased release of data, automated bioinformatics strategies are needed to identify, reconstruct and model such systems. For that purpose, we have developed a knowledge warehouse dedicated to the representation and acquisition of bacterial integrated systems involved in the exchange of the bacterial cell with its environment. RESULTS: ISYMOD is a knowledge warehouse that consistently integrates in the same environment the data and the methods used for their acquisition. This is achieved through the construction of (1) a domain knowledge base (DKB) devoted to the storage of the knowledge about the systems, their functional specificities, their partners and how they are related and (2) a methodological knowledge base (MKB) which depicts the task layout used to identify and reconstruct functional integrated systems. Instantiation of the DKB is obtained by solving the tasks of the MKB, whereas some tasks need instances of the DKB to be solved. AROM, an object-based knowledge representation system, has been used to design the DKB, and its task manager, AROMTasks, for developing the MKB. In this study two integrated systems, ABC transporters and two component systems, both involved in adaptation processes of a bacterial cell to its biotope, have been used to evaluate the feasibility of the approach.