ABSTRACT
MOTIVATION: Circulating-cell free DNA (cfDNA) is widely explored as a noninvasive biomarker for cancer screening and diagnosis. The ability to decode the cells of origin in cfDNA would provide biological insights into pathophysiological mechanisms, aiding in cancer characterization and directing clinical management and follow-up. RESULTS: We developed a DNA methylation signature-based deconvolution algorithm, MetDecode, for cancer tissue origin identification. We built a reference atlas exploiting de novo and published whole-genome methylation sequencing data for colorectal, breast, ovarian, and cervical cancer, and blood-cell-derived entities. MetDecode models the contributors absent in the atlas with methylation patterns learnt on-the-fly from the input cfDNA methylation profiles. In addition, our model accounts for the coverage of each marker region to alleviate potential sources of noise. In-silico experiments showed a limit of detection down to 2.88% of tumor tissue contribution in cfDNA. MetDecode produced Pearson correlation coefficients above 0.95 and outperformed other methods in simulations (P < 0.001; T-test; one-sided). In plasma cfDNA profiles from cancer patients, MetDecode assigned the correct tissue-of-origin in 84.2% of cases. In conclusion, MetDecode can unravel alterations in the cfDNA pool components by accurately estimating the contribution of multiple tissues, while supplied with an imperfect reference atlas. AVAILABILITY AND IMPLEMENTATION: MetDecode is available at https://github.com/JorisVermeeschLab/MetDecode.
Subject(s)
Algorithms , Biomarkers, Tumor , Cell-Free Nucleic Acids , DNA Methylation , Neoplasms , Humans , Neoplasms/genetics , Cell-Free Nucleic Acids/blood , Biomarkers, Tumor/bloodABSTRACT
MOTIVATION: The prediction of reliable Drug-Target Interactions (DTIs) is a key task in computer-aided drug design and repurposing. Here, we present a new approach based on data fusion for DTI prediction built on top of the NXTfusion library, which generalizes the Matrix Factorization paradigm by extending it to the nonlinear inference over Entity-Relation graphs. RESULTS: We benchmarked our approach on five datasets and we compared our models against state-of-the-art methods. Our models outperform most of the existing methods and, simultaneously, retain the flexibility to predict both DTIs as binary classification and regression of the real-valued drug-target affinity, competing with models built explicitly for each task. Moreover, our findings suggest that the validation of DTI methods should be stricter than what has been proposed in some previous studies, focusing more on mimicking real-life DTI settings where predictions for previously unseen drugs, proteins, and drug-protein pairs are needed. These settings are exactly the context in which the benefit of integrating heterogeneous information with our Entity-Relation data fusion approach is the most evident. AVAILABILITY AND IMPLEMENTATION: All software and data are available at https://github.com/eugeniomazzone/CPI-NXTFusion and https://pypi.org/project/NXTfusion/.
Subject(s)
Drug Development , Software , Proteins , Drug Interactions , Drug DesignABSTRACT
Federated multipartner machine learning has been touted as an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource-intensive. In the landmark MELLODDY project, indeed, each of ten pharmaceutical companies realized aggregated improvements on its own classification or regression models through federated learning. To this end, they leveraged a novel implementation extending multitask learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma data set of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate the predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point toward an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performance, albeit with a saturating return. Markedly higher improvements were observed for the pharmacokinetics and safety panel assay-based task subsets.
Subject(s)
Benchmarking , Quantitative Structure-Activity Relationship , Biological Assay , Machine LearningABSTRACT
In many cases, the unprecedented availability of data provided by high-throughput sequencing has shifted the bottleneck from a data availability issue to a data interpretation issue, thus delaying the promised breakthroughs in genetics and precision medicine, for what concerns Human genetics, and phenotype prediction to improve plant adaptation to climate change and resistance to bioagressors, for what concerns plant sciences. In this paper, we propose a novel Genome Interpretation paradigm, which aims at directly modeling the genotype-to-phenotype relationship, and we focus on A. thaliana since it is the best studied model organism in plant genetics. Our model, called Galiana, is the first end-to-end Neural Network (NN) approach following the genomes in/phenotypes out paradigm and it is trained to predict 288 real-valued Arabidopsis thaliana phenotypes from Whole Genome sequencing data. We show that 75 of these phenotypes are predicted with a Pearson correlation ≥0.4, and are mostly related to flowering traits. We show that our end-to-end NN approach achieves better performances and larger phenotype coverage than models predicting single phenotypes from the GWAS-derived known associated genes. Galiana is also fully interpretable, thanks to the Saliency Maps gradient-based approaches. We followed this interpretation approach to identify 36 novel genes that are likely to be associated with flowering traits, finding evidence for 6 of them in the existing literature.
Subject(s)
Arabidopsis , Arabidopsis/genetics , Genome , Genome-Wide Association Study , Genotype , Neural Networks, Computer , Phenotype , Whole Genome SequencingABSTRACT
MOTIVATION: Transcriptional regulation mechanisms allow cells to adapt and respond to external stimuli by altering gene expression. The possible cell transcriptional states are determined by the underlying gene regulatory network (GRN), and reliably inferring such network would be invaluable to understand biological processes and disease progression. RESULTS: In this article, we present a novel method for the inference of GRNs, called PORTIA, which is based on robust precision matrix estimation, and we show that it positively compares with state-of-the-art methods while being orders of magnitude faster. We extensively validated PORTIA using the DREAM and MERLIN+P datasets as benchmarks. In addition, we propose a novel scoring metric that builds on graph-theoretical concepts. AVAILABILITY AND IMPLEMENTATION: The code and instructions for data acquisition and full reproduction of our results are available at https://github.com/AntoinePassemiers/PORTIA-Manuscript. PORTIA is available on PyPI as a Python package (portia-grn). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Algorithms , Gene Regulatory Networks , Gene Expression RegulationABSTRACT
MOTIVATION: Modern bioinformatics is facing increasingly complex problems to solve, and we are indeed rapidly approaching an era in which the ability to seamlessly integrate heterogeneous sources of information will be crucial for the scientific progress. Here, we present a novel non-linear data fusion framework that generalizes the conventional matrix factorization paradigm allowing inference over arbitrary entity-relation graphs, and we applied it to the prediction of protein-protein interactions (PPIs). Improving our knowledge of PPI networks at the proteome scale is indeed crucial to understand protein function, physiological and disease states and cell life in general. RESULTS: We devised three data fusion-based models for the proteome-level prediction of PPIs, and we show that our method outperforms state of the art approaches on common benchmarks. Moreover, we investigate its predictions on newly published PPIs, showing that this new data has a clear shift in its underlying distributions and we thus train and test our models on this extended dataset. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
ABSTRACT
MOTIVATION: Proteins able to undergo liquid-liquid phase separation (LLPS) in vivo and in vitro are drawing a lot of interest, due to their functional relevance for cell life. Nevertheless, the proteome-scale experimental screening of these proteins seems unfeasible, because besides being expensive and time-consuming, LLPS is heavily influenced by multiple environmental conditions such as concentration, pH and temperature, thus requiring a combinatorial number of experiments for each protein. RESULTS: To overcome this problem, we propose a neural network model able to predict the LLPS behavior of proteins given specified experimental conditions, effectively predicting the outcome of in vitro experiments. Our model can be used to rapidly screen proteins and experimental conditions searching for LLPS, thus reducing the search space that needs to be covered experimentally. We experimentally validate Droppler's prediction on the TAR DNA-binding protein in different experimental conditions, showing the consistency of its predictions. AVAILABILITY AND IMPLEMENTATION: A python implementation of Droppler is available at https://bitbucket.org/grogdrinker/droppler. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
ABSTRACT
Notwithstanding important advances in the context of single-variant pathogenicity identification, novel breakthroughs in discerning the origins of many rare diseases require methods able to identify more complex genetic models. We present here the Variant Combinations Pathogenicity Predictor (VarCoPP), a machine-learning approach that identifies pathogenic variant combinations in gene pairs (called digenic or bilocus variant combinations). We show that the results produced by this method are highly accurate and precise, an efficacy that is endorsed when validating the method on recently published independent disease-causing data. Confidence labels of 95% and 99% are identified, representing the probability of a bilocus combination being a true pathogenic result, providing geneticists with rational markers to evaluate the most relevant pathogenic combinations and limit the search space and time. Finally, the VarCoPP has been designed to act as an interpretable method that can provide explanations on why a bilocus combination is predicted as pathogenic and which biological information is important for that prediction. This work provides an important step toward the genetic understanding of rare diseases, paving the way to clinical knowledge and improved patient care.
Subject(s)
Genetic Predisposition to Disease/genetics , Genetic Variation/genetics , Rare Diseases/genetics , Genetic Markers/genetics , HumansABSTRACT
BACKGROUND: Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics methods have attempted to solve this complex task. RESULTS: In this study, we investigate the assumptions on which these methods are based, showing that the different definitions of driver and passenger variants influence the difficulty of the prediction task. More importantly, we prove that the data sets have a construction bias which prevents the machine learning (ML) methods to actually learn variant-level functional effects, despite their excellent performance. This effect results from the fact that in these data sets, the driver variants map to a few driver genes, while the passenger variants spread across thousands of genes, and thus just learning to recognize driver genes provides almost perfect predictions. CONCLUSIONS: To mitigate this issue, we propose a novel data set that minimizes this bias by ensuring that all genes covered by the data contain both driver and passenger variants. As a result, we show that the tested predictors experience a significant drop in performance, which should not be considered as poorer modeling, but rather as correcting unwarranted optimism. Finally, we propose a weighting procedure to completely eliminate the gene effects on such predictions, thus precisely evaluating the ability of predictors to model the functional effects of single variants, and we show that indeed this task is still open.
Subject(s)
Carcinogenesis/genetics , Disease Progression , Machine Learning , Medical Oncology/instrumentation , Neoplasms/genetics , Precision Medicine/instrumentation , Neoplasms/pathologyABSTRACT
Protein solubility is a key aspect for many biotechnological, biomedical and industrial processes, such as the production of active proteins and antibodies. In addition, understanding the molecular determinants of the solubility of proteins may be crucial to shed light on the molecular mechanisms of diseases caused by aggregation processes such as amyloidosis. Here we present SKADE, a novel Neural Network protein solubility predictor and we show how it can provide novel insight into the protein solubility mechanisms, thanks to its neural attention architecture. First, we show that SKADE positively compares with state of the art tools while using just the protein sequence as input. Then, thanks to the neural attention mechanism, we use SKADE to investigate the patterns learned during training and we analyse its decision process. We use this peculiarity to show that, while the attention profiles do not correlate with obvious sequence aspects such as biophysical properties of the aminoacids, they suggest that N- and C-termini are the most relevant regions for solubility prediction and are predictive for complex emergent properties such as aggregation-prone regions involved in beta-amyloidosis and contact density. Moreover, SKADE is able to identify mutations that increase or decrease the overall solubility of the protein, allowing it to be used to perform large scale in-silico mutagenesis of proteins in order to maximize their solubility.
Subject(s)
Computational Biology/methods , Nerve Net/physiology , Solubility , Algorithms , Amino Acid Sequence/physiology , Amino Acids , Animals , Computer Simulation , Humans , Models, Molecular , Protein Conformation , Proteins/chemistry , Proteins/metabolism , SoftwareABSTRACT
MOTIVATION: Eukaryotic cells contain different membrane-delimited compartments, which are crucial for the biochemical reactions necessary to sustain cell life. Recent studies showed that cells can also trigger the formation of membraneless organelles composed by phase-separated proteins to respond to various stimuli. These condensates provide new ways to control the reactions and phase-separation proteins (PSPs) are thus revolutionizing how cellular organization is conceived. The small number of experimentally validated proteins, and the difficulty in discovering them, remain bottlenecks in PSPs research. RESULTS: Here we present PSPer, the first in-silico screening tool for prion-like RNA-binding PSPs. We show that it can prioritize PSPs among proteins containing similar RNA-binding domains, intrinsically disordered regions and prions. PSPer is thus suitable to screen proteomes, identifying the most likely PSPs for further experimental investigation. Moreover, its predictions are fully interpretable in the sense that it assigns specific functional regions to the predicted proteins, providing valuable information for experimental investigation of targeted mutations on these regions. Finally, we show that it can estimate the ability of artificially designed proteins to form condensates (r=-0.87), thus providing an in-silico screening tool for protein design experiments. AVAILABILITY AND IMPLEMENTATION: PSPer is available at bio2byte.com/psp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
RNA-Binding Proteins/metabolism , Organelles , Prions , ProteomeABSTRACT
SUMMARY: Inferring a Gene Regulatory Network (GRN) from gene expression data is a computationally expensive task, exacerbated by increasing data sizes due to advances in high-throughput gene profiling technology, such as single-cell RNA-seq. To equip researchers with a toolset to infer GRNs from large expression datasets, we propose GRNBoost2 and the Arboreto framework. GRNBoost2 is an efficient algorithm for regulatory network inference using gradient boosting, based on the GENIE3 architecture. Arboreto is a computational framework that scales up GRN inference algorithms complying with this architecture. Arboreto includes both GRNBoost2 and an improved implementation of GENIE3, as a user-friendly open source Python package. AVAILABILITY AND IMPLEMENTATION: Arboreto is available under the 3-Clause BSD license at http://arboreto.readthedocs.io. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Algorithms , Gene Regulatory Networks , Computational Biology , Gene Expression , SoftwareABSTRACT
BACKGROUND: We need high-quality data to assess the determinants for COVID-19 severity in people with MS (PwMS). Several studies have recently emerged but there is great benefit in aligning data collection efforts at a global scale. OBJECTIVES: Our mission is to scale-up COVID-19 data collection efforts and provide the MS community with data-driven insights as soon as possible. METHODS: Numerous stakeholders were brought together. Small dedicated interdisciplinary task forces were created to speed-up the formulation of the study design and work plan. First step was to agree upon a COVID-19 MS core data set. Second, we worked on providing a user-friendly and rapid pipeline to share COVID-19 data at a global scale. RESULTS: The COVID-19 MS core data set was agreed within 48 hours. To date, 23 data collection partners are involved and the first data imports have been performed successfully. Data processing and analysis is an on-going process. CONCLUSIONS: We reached a consensus on a core data set and established data sharing processes with multiple partners to address an urgent need for information to guide clinical practice. First results show that partners are motivated to share data to attain the ultimate joint goal: better understand the effect of COVID-19 in PwMS.
Subject(s)
Coronavirus Infections/physiopathology , Multiple Sclerosis/therapy , Pneumonia, Viral/physiopathology , Registries , Betacoronavirus , COVID-19 , Coronavirus Infections/complications , Coronavirus Infections/therapy , Data Collection , Humans , Information Dissemination , International Cooperation , Multiple Sclerosis/complications , Pandemics , Pneumonia, Viral/complications , Pneumonia, Viral/therapy , Risk Factors , SARS-CoV-2 , Treatment OutcomeABSTRACT
In drug discovery, knowledge of the graph structure of chemical compounds is essential. Many thousands of scientific articles and patents in chemistry and pharmaceutical sciences have investigated chemical compounds, but in many cases, the details of the structure of these chemical compounds are published only as an image. A tool to analyze these images automatically and convert them into a chemical graph structure would be useful for many applications, such as drug discovery. A few such tools are available and they are mostly derived from optical character recognition. However, our evaluation of the performance of these tools reveals that they often make mistakes in recognizing the correct bond multiplicity and stereochemical information. In addition, errors sometimes even lead to missing atoms in the resulting graph. In our work, we address these issues by developing a compound recognition method based on machine learning. More specifically, we develop a deep neural network model for optical compound recognition. The deep learning solution presented here consists of a segmentation model, followed by three classification models that predict atom locations, bonds, and charges. Furthermore, this model not only predicts the graph structure of the molecule but also provides all information necessary to relate each component of the resulting graph to the source image. This solution is scalable and can rapidly process thousands of images. Finally, we empirically compare the proposed method with the well-established tool OSRA1 and observe significant error reduction.
Subject(s)
Deep Learning , Drug Discovery , Machine Learning , Neural Networks, ComputerSubject(s)
Biometric Identification/ethics , Databases, Genetic/ethics , Genetic Privacy/ethics , Genetic Privacy/legislation & jurisprudence , Genome, Human/genetics , Human Rights/legislation & jurisprudence , Sequence Analysis, DNA/ethics , Biometric Identification/methods , China , DNA Fingerprinting/ethics , Federal Government , Genetic Privacy/standards , Human Rights/ethics , Human Rights/standards , Humans , Internationality , Pedigree , Population Surveillance/methodsABSTRACT
Motivation: Evolutionary information is crucial for the annotation of proteins in bioinformatics. The amount of retrieved homologs often correlates with the quality of predicted protein annotations related to structure or function. With a growing amount of sequences available, fast and reliable methods for homology detection are essential, as they have a direct impact on predicted protein annotations. Results: We developed a discriminative, alignment-free algorithm for homology detection with quasi-linear complexity, enabling theoretically much faster homology searches. To reach this goal, we convert the protein sequence into numeric biophysical representations. These are shrunk to a fixed length using a novel vector quantization method which uses a Discrete Cosine Transform compression. We then compute, for each compressed representation, similarity scores between proteins with the Dynamic Time Warping algorithm and we feed them into a Random Forest. The WARP performances are comparable with state of the art methods. Availability and implementation: The method is available at http://ibsquare.be/warp. Supplementary information: Supplementary data are available at Bioinformatics online.
Subject(s)
High-Throughput Nucleotide Sequencing , Proteins/chemistry , Algorithms , Amino Acid Sequence , Data Compression , Molecular Sequence Annotation , Software , Time FactorsABSTRACT
Motivation: Most gene prioritization methods model each disease or phenotype individually, but this fails to capture patterns common to several diseases or phenotypes. To overcome this limitation, we formulate the gene prioritization task as the factorization of a sparsely filled gene-phenotype matrix, where the objective is to predict the unknown matrix entries. To deliver more accurate gene-phenotype matrix completion, we extend classical Bayesian matrix factorization to work with multiple side information sources. The availability of side information allows us to make non-trivial predictions for genes for which no previous disease association is known. Results: Our gene prioritization method can innovatively not only integrate data sources describing genes, but also data sources describing Human Phenotype Ontology terms. Experimental results on our benchmarks show that our proposed model can effectively improve accuracy over the well-established gene prioritization method, Endeavour. In particular, our proposed method offers promising results on diseases of the nervous system; diseases of the eye and adnexa; endocrine, nutritional and metabolic diseases; and congenital malformations, deformations and chromosomal abnormalities, when compared to Endeavour. Availability and implementation: The Bayesian data fusion method is implemented as a Python/C++ package: https://github.com/jaak-s/macau. It is also available as a Julia package: https://github.com/jaak-s/BayesianDataFusion.jl. All data and benchmarks generated or analyzed during this study can be downloaded at https://owncloud.esat.kuleuven.be/index.php/s/UGb89WfkZwMYoTn. Supplementary information: Supplementary data are available at Bioinformatics online.
Subject(s)
Computational Biology/methods , Gene Ontology , Genetic Predisposition to Disease , Information Storage and Retrieval/methods , Software , Algorithms , Bayes Theorem , HumansABSTRACT
Motivation: Computational gene prioritization can aid in disease gene identification. Here, we propose pBRIT (prioritization using Bayesian Ridge regression and Information Theoretic model), a novel adaptive and scalable prioritization tool, integrating Pubmed abstracts, Gene Ontology, Sequence similarities, Mammalian and Human Phenotype Ontology, Pathway, Interactions, Disease Ontology, Gene Association database and Human Genome Epidemiology database, into the prediction model. We explore and address effects of sparsity and inter-feature dependencies within annotation sources, and the impact of bias towards specific annotations. Results: pBRIT models feature dependencies and sparsity by an Information-Theoretic (data driven) approach and applies intermediate integration based data fusion. Following the hypothesis that genes underlying similar diseases will share functional and phenotype characteristics, it incorporates Bayesian Ridge regression to learn a linear mapping between functional and phenotype annotations. Genes are prioritized on phenotypic concordance to the training genes. We evaluated pBRIT against nine existing methods, and on over 2000 HPO-gene associations retrieved after construction of pBRIT data sources. We achieve maximum AUC scores ranging from 0.92 to 0.96 against benchmark datasets and of 0.80 against the time-stamped HPO entries, indicating good performance with high sensitivity and specificity. Our model shows stable performance with regard to changes in the underlying annotation data, is fast and scalable for implementation in routine pipelines. Availability and implementation: http://biomina.be/apps/pbrit/; https://bitbucket.org/medgenua/pbrit. Supplementary information: Supplementary data are available at Bioinformatics online.
Subject(s)
Biological Ontologies , Computational Biology/methods , Information Storage and Retrieval/methods , Phenotype , Software , Animals , Bayes Theorem , Genomics/methods , Humans , Sequence Analysis, DNA/methodsABSTRACT
To further our understanding of the complexity and genetic heterogeneity of rare diseases, it has become essential to shed light on how combinations of variants in different genes are responsible for a disease phenotype. With the appearance of a resource on digenic diseases, it has become possible to evaluate how digenic combinations differ in terms of the phenotypes they produce. All instances in this resource were assigned to two classes of digenic effects, annotated as true digenic and composite classes. Whereas in the true digenic class variants in both genes are required for developing the disease, in the composite class, a variant in one gene is sufficient to produce the phenotype, but an additional variant in a second gene impacts the disease phenotype or alters the age of onset. We show that a combination of variant, gene and higher-level features can differentiate between these two classes with high accuracy. Moreover, we show via the analysis of three digenic disorders that a digenic effect decision profile, extracted from the predictive model, motivates why an instance was assigned to either of the two classes. Together, our results show that digenic disease data generates novel insights, providing a glimpse into the oligogenic realm.
Subject(s)
Epistasis, Genetic/physiology , Genetic Diseases, Inborn/genetics , Mutation/physiology , Computational Biology/methods , Datasets as Topic , Genetic Association Studies/methods , Genetic Diseases, Inborn/diagnosis , Genetic Predisposition to Disease , Humans , Models, Genetic , Phenotype , Prognosis , Validation Studies as TopicABSTRACT
BACKGROUND: The deployment of Genome-wide association studies (GWASs) requires genomic information of a large population to produce reliable results. This raises significant privacy concerns, making people hesitate to contribute their genetic information to such studies. RESULTS: We propose two provably secure solutions to address this challenge: (1) a somewhat homomorphic encryption (HE) approach, and (2) a secure multiparty computation (MPC) approach. Unlike previous work, our approach does not rely on adding noise to the input data, nor does it reveal any information about the patients. Our protocols aim to prevent data breaches by calculating the χ2 statistic in a privacy-preserving manner, without revealing any information other than whether the statistic is significant or not. Specifically, our protocols compute the χ2 statistic, but only return a yes/no answer, indicating significance. By not revealing the statistic value itself but only the significance, our approach thwarts attacks exploiting statistic values. We significantly increased the efficiency of our HE protocols by introducing a new masking technique to perform the secure comparison that is necessary for determining significance. CONCLUSIONS: We show that full-scale privacy-preserving GWAS is practical, as long as the statistics can be computed by low degree polynomials. Our implementations demonstrated that both approaches are efficient. The secure multiparty computation technique completes its execution in approximately 2 ms for data contributed by one million subjects.