Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 13 de 13
Filter
1.
Brief Bioinform ; 25(1)2023 11 22.
Article in English | MEDLINE | ID: mdl-38018908

ABSTRACT

Multi-omic analyses are necessary to understand the complex biological processes taking place at the tissue and cell level, but also to make reliable predictions about, for example, disease outcome. Several linear methods exist that create a joint embedding using paired information per sample, but recently there has been a rise in the popularity of neural architectures that embed paired -omics into the same non-linear manifold. This work describes a head-to-head comparison of linear and non-linear joint embedding methods using both bulk and single-cell multi-modal datasets. We found that non-linear methods have a clear advantage with respect to linear ones for missing modality imputation. Performance comparisons in the downstream tasks of survival analysis for bulk tumor data and cell type classification for single-cell data lead to the following insights: First, concatenating the principal components of each modality is a competitive baseline and hard to beat if all modalities are available at test time. However, if we only have one modality available at test time, training a predictive model on the joint space of that modality can lead to performance improvements with respect to just using the unimodal principal components. Second, -omic profiles imputed by neural joint embedding methods are realistic enough to be used by a classifier trained on real data with limited performance drops. Taken together, our comparisons give hints to which joint embedding to use for which downstream task. Overall, product-of-experts performed well in most tasks and was reasonably fast, while early integration (concatenation) of modalities did quite poorly.


Subject(s)
Multiomics , Neoplasms , Humans
2.
Bioinformatics ; 37(2): 162-170, 2021 04 19.
Article in English | MEDLINE | ID: mdl-32797179

ABSTRACT

MOTIVATION: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. RESULTS: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. AVAILABILITY AND IMPLEMENTATION: Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Proteins , Software , Amino Acid Sequence , Neural Networks, Computer , Proteins/genetics
3.
Bioinformatics ; 36(4): 1182-1190, 2020 02 15.
Article in English | MEDLINE | ID: mdl-31562759

ABSTRACT

MOTIVATION: Co-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, when using RNA-Seq datasets with many experimental conditions from diverse sources, only a subset of the experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similarly functioning genes, the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest. RESULTS: To address this, we developed Metric Learning for Co-expression (MLC), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance. Moreover, our method is particularly good at more specific terms, which are the most interesting. Finally, by observing the sample weights for a particular GO term, one can identify which experiments are important for learning that term and potentially identify novel conditions that are relevant, as demonstrated by experiments in both A. thaliana and Pseudomonas Aeruginosa. AVAILABILITY AND IMPLEMENTATION: MLC is available as a Python package at www.github.com/stamakro/MLC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , RNA-Seq , Gene Ontology , Phenotype
4.
Bioinformatics ; 35(7): 1116-1124, 2019 04 01.
Article in English | MEDLINE | ID: mdl-30169569

ABSTRACT

MOTIVATION: Most automatic functional annotation methods assign Gene Ontology (GO) terms to proteins based on annotations of highly similar proteins. We advocate that proteins that are less similar are still informative. Also, despite their simplicity and structure, GO terms seem to be hard for computers to learn, in particular the Biological Process ontology, which has the most terms (>29 000). We propose to use Label-Space Dimensionality Reduction (LSDR) techniques to exploit the redundancy of GO terms and transform them into a more compact latent representation that is easier to predict. RESULTS: We compare proteins using a sequence similarity profile (SSP) to a set of annotated training proteins. We introduce two new LSDR methods, one based on the structure of the GO, and one based on semantic similarity of terms. We show that these LSDR methods, as well as three existing ones, improve the Critical Assessment of Functional Annotation performance of several function prediction algorithms. Cross-validation experiments on Arabidopsis thaliana proteins pinpoint the superiority of our GO-aware LSDR over generic LSDR. Our experiments on A.thaliana proteins show that the SSP representation in combination with a kNN classifier outperforms state-of-the-art and baseline methods in terms of cross-validated F-measure. AVAILABILITY AND IMPLEMENTATION: Source code for the experiments is available at https://github.com/stamakro/SSP-LSDR. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology , Software , Algorithms , Amino Acid Sequence , Gene Ontology , Molecular Sequence Annotation
5.
Genes (Basel) ; 15(6)2024 Jun 07.
Article in English | MEDLINE | ID: mdl-38927686

ABSTRACT

BACKGROUND: Patients with advanced-stage epithelial ovarian cancer (EOC) receive treatment with a poly-ADP ribose-polymerase (PARP) inhibitor (PARPi) as maintenance therapy after surgery and chemotherapy. Unfortunately, many patients experience disease progression because of acquired therapy resistance. This study aims to characterize epigenetic and genomic changes in cell-free DNA (cfDNA) associated with PARPi resistance. MATERIALS AND METHODS: Blood was taken from 31 EOC patients receiving PARPi therapy before treatment and at disease progression during/after treatment. Resistance was defined as disease progression within 6 months after starting PARPi and was seen in fifteen patients, while sixteen patients responded for 6 to 42 months. Blood cfDNA was evaluated via Modified Fast Aneuploidy Screening Test-Sequencing System (mFast-SeqS to detect aneuploidy, via Methylated DNA Sequencing (MeD-seq) to find differentially methylated regions (DMRs), and via shallow whole-genome and -exome sequencing (shWGS, exome-seq) to define tumor fractions and mutational signatures. RESULTS: Aneuploid cfDNA was undetectable pre-treatment but observed in six patients post-treatment, in five resistant and one responding patient. Post-treatment ichorCNA analyses demonstrated in shWGS and exome-seq higher median tumor fractions in resistant (7% and 9%) than in sensitive patients (7% and 5%). SigMiner analyses detected predominantly mutational signatures linked to mismatch repair and chemotherapy. DeSeq2 analyses of MeD-seq data revealed three methylation signatures and more tumor-specific DMRs in resistant than in responding patients in both pre- and post-treatment samples (274 vs. 30 DMRs, 190 vs. 57 DMRs, Χ2-test p < 0.001). CONCLUSION: Our genome-wide Next-Generation Sequencing (NGS) analyses in PARPi-resistant patients identified epigenetic differences in blood before treatment, whereas genomic alterations were more frequently observed after progression. The epigenetic differences at baseline are especially interesting for further exploration as putative predictive biomarkers for PARPi resistance.


Subject(s)
Carcinoma, Ovarian Epithelial , DNA Methylation , Drug Resistance, Neoplasm , Epigenesis, Genetic , Ovarian Neoplasms , Poly(ADP-ribose) Polymerase Inhibitors , Humans , Female , Drug Resistance, Neoplasm/genetics , Middle Aged , Ovarian Neoplasms/genetics , Ovarian Neoplasms/drug therapy , Ovarian Neoplasms/pathology , Poly(ADP-ribose) Polymerase Inhibitors/therapeutic use , Aged , Carcinoma, Ovarian Epithelial/genetics , Carcinoma, Ovarian Epithelial/drug therapy , Carcinoma, Ovarian Epithelial/pathology , Adult , Aneuploidy , Genomics/methods
6.
NAR Genom Bioinform ; 5(2): lqad048, 2023 Jun.
Article in English | MEDLINE | ID: mdl-37274121

ABSTRACT

Cell-free DNA (cfDNA) are DNA fragments originating from dying cells that are detectable in bodily fluids, such as the plasma. Accelerated cell death, for example caused by disease, induces an elevated concentration of cfDNA. As a result, determining the cell type origins of cfDNA molecules can provide information about an individual's health. In this work, we aim to increase the sensitivity of methylation-based cell type deconvolution by adapting an existing method, CelFiE, which uses the methylation beta values of individual CpG sites to estimate cell type proportions. Our new method, CelFEER, instead differentiates cell types by the average methylation values within individual reads. We additionally improved the originally reported performance of CelFiE by using a new approach for finding marker regions that are differentially methylated between cell types. We show that CelFEER estimates cell type proportions with a higher correlation (r = 0.94 ± 0.04) than CelFiE (r = 0.86 ± 0.09) on simulated mixtures of cell types. Moreover, we show that the cell type proportion estimated by CelFEER can differentiate between ALS patients and healthy controls, between pregnant women in their first and third trimester, and between pregnant women with and without gestational diabetes.

7.
PLoS One ; 18(10): e0292126, 2023.
Article in English | MEDLINE | ID: mdl-37796856

ABSTRACT

Deep generative models, such as variational autoencoders (VAE), have gained increasing attention in computational biology due to their ability to capture complex data manifolds which subsequently can be used to achieve better performance in downstream tasks, such as cancer type prediction or subtyping of cancer. However, these models are difficult to train due to the large number of hyperparameters that need to be tuned. To get a better understanding of the importance of the different hyperparameters, we examined six different VAE models when trained on TCGA transcriptomics data and evaluated on the downstream tasks of cluster agreement with cancer subtypes and survival analysis. We studied the effect of the latent space dimensionality, learning rate, optimizer, initialization and activation function on the quality of subsequent downstream tasks on the TCGA samples. We found ß-TCVAE and DIP-VAE to have a good performance, on average, despite being more sensitive to hyperparameters selection. Based on these experiments, we derived recommendations for selecting the different hyperparameters settings. To ensure generalization, we tested all hyperparameter configurations on the GTEx dataset. We found a significant correlation (ρ = 0.7) between the hyperparameter effects on clustering performance in the TCGA and GTEx datasets. This highlights the robustness and generalizability of our recommendations. In addition, we examined whether the learned latent spaces capture biologically relevant information. Hereto, we measured the correlation and mutual information of the different representations with various data characteristics such as gender, age, days to metastasis, immune infiltration, and mutation signatures. We found that for all models the latent factors, in general, do not uniquely correlate with one of the data characteristics nor capture separable information in the latent factors even for models specifically designed for disentanglement.


Subject(s)
Benchmarking , Neoplasms , Humans , Transcriptome , Neoplasms/genetics , Gene Expression Profiling , Cluster Analysis
8.
Sci Rep ; 13(1): 10424, 2023 06 27.
Article in English | MEDLINE | ID: mdl-37369746

ABSTRACT

Next generation sequencing of cell-free DNA (cfDNA) is a promising method for treatment monitoring and therapy selection in metastatic breast cancer (MBC). However, distinguishing tumor-specific variants from sequencing artefacts and germline variation with low false discovery rate is challenging when using large targeted sequencing panels covering many tumor suppressor genes. To address this, we built a machine learning model to remove false positive variant calls and augmented it with additional filters to ensure selection of tumor-derived variants. We used cfDNA of 70 MBC patients profiled with both the small targeted Oncomine breast panel (Thermofisher) and the much larger Qiaseq Human Breast Cancer Panel (Qiagen). The model was trained on the panels' common regions using Oncomine hotspot mutations as ground truth. Applied to Qiaseq data, it achieved 35% sensitivity and 36% precision, outperforming basic filtering. For 20 patients we used germline DNA to filter for somatic variants and obtained 245 variants in total, while our model found seven variants, of which six were also detected using the germline strategy. In ten tumor-free individuals, our method detected in total one (potentially germline) variant, in contrast to 521 variants detected without our model. These results indicate that our model largely detects somatic variants.


Subject(s)
Breast Neoplasms , Cell-Free Nucleic Acids , Humans , Female , Breast Neoplasms/genetics , Cell-Free Nucleic Acids/genetics , Mutation , Breast , High-Throughput Nucleotide Sequencing , Machine Learning
9.
Evol Bioinform Online ; 17: 11769343211062608, 2021.
Article in English | MEDLINE | ID: mdl-34880594

ABSTRACT

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.

10.
PLoS One ; 15(11): e0242723, 2020.
Article in English | MEDLINE | ID: mdl-33237964

ABSTRACT

Physical interaction between two proteins is strong evidence that the proteins are involved in the same biological process, making Protein-Protein Interaction (PPI) networks a valuable data resource for predicting the cellular functions of proteins. However, PPI networks are largely incomplete for non-model species. Here, we tested to what extent these incomplete networks are still useful for genome-wide function prediction. We used two network-based classifiers to predict Biological Process Gene Ontology terms from protein interaction data in four species: Saccharomyces cerevisiae, Escherichia coli, Arabidopsis thaliana and Solanum lycopersicum (tomato). The classifiers had reasonable performance in the well-studied yeast, but performed poorly in the other species. We showed that this poor performance can be considerably improved by adding edges predicted from various data sources, such as text mining, and that associations from the STRING database are more useful than interactions predicted by a neural network from sequence-based features.


Subject(s)
Arabidopsis Proteins , Arabidopsis , Escherichia coli Proteins , Escherichia coli , Molecular Sequence Annotation , Protein Interaction Maps/physiology , Saccharomyces cerevisiae Proteins , Saccharomyces cerevisiae , Solanum lycopersicum , Arabidopsis/genetics , Arabidopsis/metabolism , Arabidopsis Proteins/genetics , Arabidopsis Proteins/metabolism , Escherichia coli/genetics , Escherichia coli/metabolism , Escherichia coli Proteins/genetics , Escherichia coli Proteins/metabolism , Solanum lycopersicum/genetics , Solanum lycopersicum/metabolism , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , Saccharomyces cerevisiae Proteins/genetics , Saccharomyces cerevisiae Proteins/metabolism
11.
Genes (Basel) ; 11(11)2020 10 27.
Article in English | MEDLINE | ID: mdl-33120976

ABSTRACT

The current rate at which new DNA and protein sequences are being generated is too fast to experimentally discover the functions of those sequences, emphasizing the need for accurate Automatic Function Prediction (AFP) methods. AFP has been an active and growing research field for decades and has made considerable progress in that time. However, it is certainly not solved. In this paper, we describe challenges that the AFP field still has to overcome in the future to increase its applicability. The challenges we consider are how to: (1) include condition-specific functional annotation, (2) predict functions for non-model species, (3) include new informative data sources, (4) deal with the biases of Gene Ontology (GO) annotations, and (5) maximally exploit the GO to obtain performance gains. We also provide recommendations for addressing those challenges, by adapting (1) the way we represent proteins and genes, (2) the way we represent gene functions, and (3) the algorithms that perform the prediction from gene to function. Together, we show that AFP is still a vibrant research area that can benefit from continuing advances in machine learning with which AFP in the 2020s can again take a large step forward reinforcing the power of computational biology.


Subject(s)
Computational Biology/methods , Gene Ontology , Molecular Sequence Annotation/methods , Proteins/metabolism , Algorithms , Amino Acid Sequence/genetics , Electronic Data Processing , Machine Learning , Models, Biological , Proteins/genetics
13.
IEEE J Biomed Health Inform ; 19(3): 1137-45, 2015 May.
Article in English | MEDLINE | ID: mdl-24951709

ABSTRACT

Valid characterization of carotid atherosclerosis (CA) is a crucial public health issue, which would limit the major risks held by CA for both patient safety and state economies. This paper investigated the unexplored potential of kinematic features in assisting the diagnostic decision for CA in the framework of a computer-aided diagnosis (CAD) tool. To this end, 15 CAD schemes were designed and were fed with a wide variety of kinematic features of the atherosclerotic plaque and the arterial wall adjacent to the plaque for 56 patients from two different hospitals. The CAD schemes were benchmarked in terms of their ability to discriminate between symptomatic and asymptomatic patients and the combination of the Fisher discriminant ratio, as a feature-selection strategy, and support vector machines, in the classification module, was revealed as the optimal motion-based CAD tool. The particular CAD tool was evaluated with several cross-validation strategies and yielded higher than 88% classification accuracy; the texture-based CAD performance in the same dataset was 80%. The incorporation of kinematic features of the arterial wall in CAD seems to have a particularly favorable impact on the performance of image-data-driven diagnosis for CA, which remains to be further elucidated in future prospective studies on large datasets.


Subject(s)
Carotid Arteries , Carotid Artery Diseases , Image Interpretation, Computer-Assisted/methods , Plaque, Atherosclerotic , Adult , Aged , Aged, 80 and over , Biomechanical Phenomena , Carotid Arteries/diagnostic imaging , Carotid Arteries/pathology , Carotid Arteries/physiopathology , Carotid Artery Diseases/diagnostic imaging , Carotid Artery Diseases/pathology , Databases, Factual , Humans , Middle Aged , Plaque, Atherosclerotic/diagnostic imaging , Plaque, Atherosclerotic/pathology , Ultrasonography
SELECTION OF CITATIONS
SEARCH DETAIL