Search | VHL Regional Portal

1.

MOSAIC: An Artificial Intelligence-Based Framework for Multimodal Analysis, Classification, and Personalized Prognostic Assessment in Rare Cancers.

D'Amico, Saverio; Dall'Olio, Lorenzo; Rollo, Cesare; Alonso, Patricia; Prada-Luengo, Iñigo; Dall'Olio, Daniele; Sala, Claudia; Sauta, Elisabetta; Asti, Gianluca; Lanino, Luca; Maggioni, Giulia; Campagna, Alessia; Zazzetti, Elena; Delleani, Mattia; Bicchieri, Maria Elena; Morandini, Pierandrea; Savevski, Victor; Arroyo, Borja; Parras, Juan; Zhao, Lin Pierre; Platzbecker, Uwe; Diez-Campelo, Maria; Santini, Valeria; Fenaux, Pierre; Haferlach, Torsten; Krogh, Anders; Zazo, Santiago; Fariselli, Piero; Sanavia, Tiziana; Della Porta, Matteo Giovanni; Castellani, Gastone.

JCO Clin Cancer Inform ; 8: e2400008, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38875514

ABSTRACT

PURPOSE: Rare cancers constitute over 20% of human neoplasms, often affecting patients with unmet medical needs. The development of effective classification and prognostication systems is crucial to improve the decision-making process and drive innovative treatment strategies. We have created and implemented MOSAIC, an artificial intelligence (AI)-based framework designed for multimodal analysis, classification, and personalized prognostic assessment in rare cancers. Clinical validation was performed on myelodysplastic syndrome (MDS), a rare hematologic cancer with clinical and genomic heterogeneities. METHODS: We analyzed 4,427 patients with MDS divided into training and validation cohorts. Deep learning methods were applied to integrate and impute clinical/genomic features. Clustering was performed by combining Uniform Manifold Approximation and Projection for Dimension Reduction + Hierarchical Density-Based Spatial Clustering of Applications with Noise (UMAP + HDBSCAN) methods, compared with the conventional Hierarchical Dirichlet Process (HDP). Linear and AI-based nonlinear approaches were compared for survival prediction. Explainable AI (Shapley Additive Explanations approach [SHAP]) and federated learning were used to improve the interpretation and the performance of the clinical models, integrating them into distributed infrastructure. RESULTS: UMAP + HDBSCAN clustering obtained a more granular patient stratification, achieving a higher average silhouette coefficient (0.16) with respect to HDP (0.01) and higher balanced accuracy in cluster classification by Random Forest (92.7% ± 1.3% and 85.8% ± 0.8%). AI methods for survival prediction outperform conventional statistical techniques and the reference prognostic tool for MDS. Nonlinear Gradient Boosting Survival stands in the internal (Concordance-Index [C-Index], 0.77; SD, 0.01) and external validation (C-Index, 0.74; SD, 0.02). SHAP analysis revealed that similar features drove patients' subgroups and outcomes in both training and validation cohorts. Federated implementation improved the accuracy of developed models. CONCLUSION: MOSAIC provides an explainable and robust framework to optimize classification and prognostic assessment of rare cancers. AI-based approaches demonstrated superior accuracy in capturing genomic similarities and providing individual prognostic information compared with conventional statistical methods. Its federated implementation ensures broad clinical application, guaranteeing high performance and data protection.

Subject(s)

Artificial Intelligence , Precision Medicine , Humans , Prognosis , Precision Medicine/methods , Female , Rare Diseases/classification , Rare Diseases/genetics , Rare Diseases/diagnosis , Male , Deep Learning , Neoplasms/classification , Neoplasms/genetics , Neoplasms/diagnosis , Myelodysplastic Syndromes/diagnosis , Myelodysplastic Syndromes/classification , Myelodysplastic Syndromes/genetics , Myelodysplastic Syndromes/therapy , Algorithms , Middle Aged , Aged , Cluster Analysis

2.

N-of-one differential gene expression without control samples using a deep generative model.

Prada-Luengo, Iñigo; Schuster, Viktoria; Liang, Yuhu; Terkelsen, Thilde; Sora, Valentina; Krogh, Anders.

Genome Biol ; 24(1): 263, 2023 Nov 16.

Article in English | MEDLINE | ID: mdl-37974217

ABSTRACT

Differential analysis of bulk RNA-seq data often suffers from lack of good controls. Here, we present a generative model that replaces controls, trained solely on healthy tissues. The unsupervised model learns a low-dimensional representation and can identify the closest normal representation for a given disease sample. This enables control-free, single-sample differential expression analysis. In breast cancer, we demonstrate how our approach selects marker genes and outperforms a state-of-the-art method. Furthermore, significant genes identified by the model are enriched in driver genes across cancers. Our results show that the in silico closest normal provides a more favorable comparison than control samples.

Subject(s)

Learning , Machine Learning , RNA-Seq/methods , Gene Expression

3.

The Deep Generative Decoder: MAP estimation of representations improves modelling of single-cell RNA data.

Schuster, Viktoria; Krogh, Anders.

Bioinformatics ; 39(9)2023 09 02.

Article in English | MEDLINE | ID: mdl-37572301

ABSTRACT

MOTIVATION: Learning low-dimensional representations of single-cell transcriptomics has become instrumental to its downstream analysis. The state of the art is currently represented by neural network models, such as variational autoencoders, which use a variational approximation of the likelihood for inference. RESULTS: We here present the Deep Generative Decoder (DGD), a simple generative model that computes model parameters and representations directly via maximum a posteriori estimation. The DGD handles complex parameterized latent distributions naturally unlike variational autoencoders, which typically use a fixed Gaussian distribution, because of the complexity of adding other types. We first show its general functionality on a commonly used benchmark set, Fashion-MNIST. Secondly, we apply the model to multiple single-cell datasets. Here, the DGD learns low-dimensional, meaningful, and well-structured latent representations with sub-clustering beyond the provided labels. The advantages of this approach are its simplicity and its capability to provide representations of much smaller dimensionality than a comparable variational autoencoder. AVAILABILITY AND IMPLEMENTATION: scDGD is available as a python package at https://github.com/Center-for-Health-Data-Science/scDGD. The remaining code is made available here: https://github.com/Center-for-Health-Data-Science/dgd.

Subject(s)

Neural Networks, Computer , RNA , Gene Expression Profiling , Probability , Normal Distribution , Single-Cell Analysis

4.

Synthetic Data Generation by Artificial Intelligence to Accelerate Research and Precision Medicine in Hematology.

D'Amico, Saverio; Dall'Olio, Daniele; Sala, Claudia; Dall'Olio, Lorenzo; Sauta, Elisabetta; Zampini, Matteo; Asti, Gianluca; Lanino, Luca; Maggioni, Giulia; Campagna, Alessia; Ubezio, Marta; Russo, Antonio; Bicchieri, Maria Elena; Riva, Elena; Tentori, Cristina A; Travaglino, Erica; Morandini, Pierandrea; Savevski, Victor; Santoro, Armando; Prada-Luengo, Iñigo; Krogh, Anders; Santini, Valeria; Kordasti, Shahram; Platzbecker, Uwe; Diez-Campelo, Maria; Fenaux, Pierre; Haferlach, Torsten; Castellani, Gastone; Della Porta, Matteo Giovanni.

JCO Clin Cancer Inform ; 7: e2300021, 2023 Jun.

Article in English | MEDLINE | ID: mdl-37390377

ABSTRACT

PURPOSE: Synthetic data are artificial data generated without including any real patient information by an algorithm trained to learn the characteristics of a real source data set and became widely used to accelerate research in life sciences. We aimed to (1) apply generative artificial intelligence to build synthetic data in different hematologic neoplasms; (2) develop a synthetic validation framework to assess data fidelity and privacy preservability; and (3) test the capability of synthetic data to accelerate clinical/translational research in hematology. METHODS: A conditional generative adversarial network architecture was implemented to generate synthetic data. Use cases were myelodysplastic syndromes (MDS) and AML: 7,133 patients were included. A fully explainable validation framework was created to assess fidelity and privacy preservability of synthetic data. RESULTS: We generated MDS/AML synthetic cohorts (including information on clinical features, genomics, treatment, and outcomes) with high fidelity and privacy performances. This technology allowed resolution of lack/incomplete information and data augmentation. We then assessed the potential value of synthetic data on accelerating research in hematology. Starting from 944 patients with MDS available since 2014, we generated a 300% augmented synthetic cohort and anticipated the development of molecular classification and molecular scoring system obtained many years later from 2,043 to 2,957 real patients, respectively. Moreover, starting from 187 MDS treated with luspatercept into a clinical trial, we generated a synthetic cohort that recapitulated all the clinical end points of the study. Finally, we developed a website to enable clinicians generating high-quality synthetic data from an existing biobank of real patients. CONCLUSION: Synthetic data mimic real clinical-genomic features and outcomes, and anonymize patient information. The implementation of this technology allows to increase the scientific use and value of real data, thus accelerating precision medicine in hematology and the conduction of clinical trials.

Subject(s)

Hematology , Leukemia, Myeloid, Acute , Humans , Precision Medicine , Artificial Intelligence , Algorithms

5.

Context dependent prediction in DNA sequence using neural networks.

Grønbæk, Christian; Liang, Yuhu; Elliott, Desmond; Krogh, Anders.

PeerJ ; 10: e13666, 2022.

Article in English | MEDLINE | ID: mdl-36157058

ABSTRACT

One way to better understand the structure in DNA is by learning to predict the sequence. Here, we trained a model to predict the missing base at any given position, given its left and right flanking contexts. Our best-performing model was a neural network that obtained an accuracy close to 54% on the human genome, which is 2% points better than modelling the data using a Markov model. In likelihood-ratio tests, the neural network performed significantly better than any of the alternative models by a large margin. We report on where the accuracy was obtained, first observing that the performance appeared to be uniform over the chromosomes. The models performed best in repetitive sequences, as expected, although their performance far from random in the more difficult coding sections, the proportions being ~70:40%. We further explored the sources of the accuracy, Fourier transforming the predictions revealed weak but clear periodic signals. In the human genome the characteristic periods hinted at connections to nucleosome positioning. We found similar periodic signals in GC/AT content in the human genome, which to the best of our knowledge have not been reported before. On other large genomes similarly high accuracy was found, while lower predictive accuracy was observed on smaller genomes. Only in the mouse genome did we see periodic signals in the same range as in the human genome, though weaker and of a different type. This indicates that the sources of these signals are other or more than nucleosome arrangement. Interestingly, applying a model trained on the mouse genome to the human genome resulted in a performance far below that of the human model, except in the difficult coding regions. Despite the clear outcomes of the likelihood-ratio tests, there is currently a limited superiority of the neural network methods over the Markov model. We expect, however, that there is great potential for better modelling DNA using different neural network architectures.

Subject(s)

Neural Networks, Computer , Nucleosomes , Humans , Animals , Mice , Base Sequence , DNA/genetics , Genome, Human

6.

Correction to: Context dependency of nucleotide probabilities and variants in human DNA.

Liang, Yuhu; Grønbæk, Christian; Fariselli, Piero; Krogh, Anders.

BMC Genomics ; 23(1): 356, 2022 May 10.

Article in English | MEDLINE | ID: mdl-35538429

7.

Context dependency of nucleotide probabilities and variants in human DNA.

Liang, Yuhu; Grønbæk, Christian; Fariselli, Piero; Krogh, Anders.

BMC Genomics ; 23(1): 87, 2022 Jan 31.

Article in English | MEDLINE | ID: mdl-35100973

ABSTRACT

BACKGROUND: Genomic DNA has been shaped by mutational processes through evolution. The cellular machinery for error correction and repair has left its marks in the nucleotide composition along with structural and functional constraints. Therefore, the probability of observing a base in a certain position in the human genome is highly context-dependent. RESULTS: Here we develop context-dependent nucleotide models. We first investigate models of nucleotides conditioned on sequence context. We develop a bidirectional Markov model that use an average of the probability from a Markov model applied to both strands of the sequence and thus depends on up to 14 bases to each side of the nucleotide. We show how the genome predictability varies across different types of genomic regions. Surprisingly, this model can predict a base from its context with an average of more than 50% accuracy. For somatic variants we show a tendency towards higher probability for the variant base than for the reference base. Inspired by DNA substitution models, we develop a model of mutability that estimates a mutation matrix (called the alpha matrix) on top of the nucleotide distribution. The alpha matrix can be estimated from a much smaller context than the nucleotide model, but the final model will still depend on the full context of the nucleotide model. With the bidirectional Markov model of order 14 and an alpha matrix dependent on just one base to each side, we obtain a model that compares well with a model of mutability that estimates mutation probabilities directly conditioned on three nucleotides to each side. For somatic variants in particular, our model fits better than the simpler model. Interestingly, the model is not very sensitive to the size of the context for the alpha matrix. CONCLUSIONS: Our study found strong context dependencies of nucleotides in the human genome. The best model uses a context of 14 nucleotides to each side. Based on these models, a substitution model was constructed that separates into the context model and a matrix dependent on a small context. The model fit somatic variants particularly well.

Subject(s)

DNA , Nucleotides , DNA/genetics , Genome, Human , Genomics , Humans , Nucleotides/genetics , Probability

8.

A Manifold Learning Perspective on Representation Learning: Learning Decoder and Representations without an Encoder.

Schuster, Viktoria; Krogh, Anders.

Entropy (Basel) ; 23(11)2021 Oct 25.

Article in English | MEDLINE | ID: mdl-34828101

ABSTRACT

Autoencoders are commonly used in representation learning. They consist of an encoder and a decoder, which provide a straightforward method to map n-dimensional data in input space to a lower m-dimensional representation space and back. The decoder itself defines an m-dimensional manifold in input space. Inspired by manifold learning, we showed that the decoder can be trained on its own by learning the representations of the training samples along with the decoder weights using gradient descent. A sum-of-squares loss then corresponds to optimizing the manifold to have the smallest Euclidean distance to the training samples, and similarly for other loss functions. We derived expressions for the number of samples needed to specify the encoder and decoder and showed that the decoder generally requires much fewer training samples to be well-specified compared to the encoder. We discuss the training of autoencoders in this perspective and relate it to previous work in the field that uses noisy training examples and other types of regularization. On the natural image data sets MNIST and CIFAR10, we demonstrated that the decoder is much better suited to learn a low-dimensional representation, especially when trained on small data sets. Using simulated gene regulatory data, we further showed that the decoder alone leads to better generalization and meaningful representations. Our approach of training the decoder alone facilitates representation learning even on small data sets and can lead to improved training of autoencoders. We hope that the simple analyses presented will also contribute to an improved conceptual understanding of representation learning.

9.

High-throughput proteomics of breast cancer interstitial fluid: identification of tumor subtype-specific serologically relevant biomarkers.

Terkelsen, Thilde; Pernemalm, Maria; Gromov, Pavel; Børresen-Dale, Anna-Lise; Krogh, Anders; Haakensen, Vilde D; Lethiö, Janne; Papaleo, Elena; Gromova, Irina.

Mol Oncol ; 15(2): 429-461, 2021 02.

Article in English | MEDLINE | ID: mdl-33176066

ABSTRACT

Despite significant advancements in breast cancer (BC) research, clinicians lack robust serological protein markers for accurate diagnostics and tumor stratification. Tumor interstitial fluid (TIF) accumulates aberrantly externalized proteins within the local tumor space, which can potentially gain access to the circulatory system. As such, TIF may represent a valuable starting point for identifying relevant tumor-specific serological biomarkers. The aim of the study was to perform comprehensive proteomic profiling of TIF to identify proteins associated with BC tumor status and subtype. A liquid chromatography tandem mass spectrometry (LC-MS/MS) analysis of 35 TIFs of three main subtypes: luminal (19), Her2 (4), and triple-negative (TNBC) (12) resulted in the identification of > 8800 proteins. Unsupervised hierarchical clustering segregated the TIF proteome into two major clusters, luminal and TNBC/Her2 subgroups. High-grade tumors enriched with tumor infiltrating lymphocytes (TILs) were also stratified from low-grade tumors. A consensus analysis approach, including differential abundance analysis, selection operator regression, and random forest returned a minimal set of 24 proteins associated with BC subtypes, receptor status, and TIL scoring. Among them, a panel of 10 proteins, AGR3, BCAM, CELSR1, MIEN1, NAT1, PIP4K2B, SEC23B, THTPA, TMEM51, and ULBP2, was found to stratify the tumor subtype-specific TIFs. In particular, upregulation of BCAM and CELSR1 differentiates luminal subtypes, while upregulation of MIEN1 differentiates Her2 subtypes. Immunohistochemistry analysis showed a direct correlation between protein abundance in TIFs and intratumor expression levels for all 10 proteins. Sensitivity and specificity were estimated for this protein panel by using an independent, comprehensive breast tumor proteome dataset. The results of this analysis strongly support our data, with eight of the proteins potentially representing biomarkers for stratification of BC subtypes. Five of the most representative proteomics databases currently available were also used to estimate the potential for these selected proteins to serve as putative serological markers.

Subject(s)

Biomarkers, Tumor/metabolism , Breast Neoplasms/metabolism , Extracellular Fluid/metabolism , Neoplasm Proteins/metabolism , Proteomics , Chromatography, Liquid , Female , Humans , Lymphocytes, Tumor-Infiltrating/metabolism , Tandem Mass Spectrometry

10.

Taxonomic classification method for metagenomics based on core protein families with Core-Kaiju.

Tovo, Anna; Menzel, Peter; Krogh, Anders; Cosentino Lagomarsino, Marco; Suweis, Samir.

Nucleic Acids Res ; 48(16): e93, 2020 09 18.

Article in English | MEDLINE | ID: mdl-32633756

ABSTRACT

Characterizing species diversity and composition of bacteria hosted by biota is revolutionizing our understanding of the role of symbiotic interactions in ecosystems. Determining microbiomes diversity implies the assignment of individual reads to taxa by comparison to reference databases. Although computational methods aimed at identifying the microbe(s) taxa are available, it is well known that inferences using different methods can vary widely depending on various biases. In this study, we first apply and compare different bioinformatics methods based on 16S ribosomal RNA gene and shotgun sequencing to three mock communities of bacteria, of which the compositions are known. We show that none of these methods can infer both the true number of taxa and their abundances. We thus propose a novel approach, named Core-Kaiju, which combines the power of shotgun metagenomics data with a more focused marker gene classification method similar to 16S, but based on emergent statistics of core protein domain families. We thus test the proposed method on various mock communities and we show that Core-Kaiju reliably predicts both number of taxa and abundances. Finally, we apply our method on human gut samples, showing how Core-Kaiju may give more accurate ecological characterization and a fresh view on real microbiomes.

Subject(s)

Bacteria/classification , Gastrointestinal Microbiome/genetics , Metagenome , Metagenomics/methods , Phylogeny , RNA, Ribosomal, 16S/genetics , Bacteria/genetics , Computational Biology , DNA, Bacterial/genetics , Databases, Protein , Genetic Markers , Humans , Sequence Analysis, DNA

11.

Secreted breast tumor interstitial fluid microRNAs and their target genes are associated with triple-negative breast cancer, tumor grade, and immune infiltration.

Terkelsen, Thilde; Russo, Francesco; Gromov, Pavel; Haakensen, Vilde Drageset; Brunak, Søren; Gromova, Irina; Krogh, Anders; Papaleo, Elena.

Breast Cancer Res ; 22(1): 73, 2020 06 30.

Article in English | MEDLINE | ID: mdl-32605588

ABSTRACT

BACKGROUND: Studies on tumor-secreted microRNAs point to a functional role of these in cellular communication and reprogramming of the tumor microenvironment. Uptake of tumor-secreted microRNAs by neighboring cells may result in the silencing of mRNA targets and, in turn, modulation of the transcriptome. Studying miRNAs externalized from tumors could improve cancer patient diagnosis and disease monitoring and help to pinpoint which miRNA-gene interactions are central for tumor properties such as invasiveness and metastasis. METHODS: Using a bioinformatics approach, we analyzed the profiles of secreted tumor and normal interstitial fluid (IF) microRNAs, from women with breast cancer (BC). We carried out differential abundance analysis (DAA), to obtain miRNAs, which were enriched or depleted in IFs, from patients with different clinical traits. Subsequently, miRNA family enrichment analysis was performed to assess whether any families were over-represented in the specific sets. We identified dysregulated genes in tumor tissues from the same cohort of patients and constructed weighted gene co-expression networks, to extract sets of co-expressed genes and co-abundant miRNAs. Lastly, we integrated miRNAs and mRNAs to obtain interaction networks and supported our findings using prediction tools and cancer gene databases. RESULTS: Network analysis showed co-expressed genes and miRNA regulators, associated with tumor lymphocyte infiltration. All of the genes were involved in immune system processes, and many had previously been associated with cancer immunity. A subset of these, BTLA, CXCL13, IL7R, LAMP3, and LTB, was linked to the presence of tertiary lymphoid structures and high endothelial venules within tumors. Co-abundant tumor interstitial fluid miRNAs within this network, including miR-146a and miR-494, were annotated as negative regulators of immune-stimulatory responses. One co-expression network encompassed differences between BC subtypes. Genes differentially co-expressed between luminal B and triple-negative breast cancer (TNBC) were connected with sphingolipid metabolism and predicted to be co-regulated by miR-23a. Co-expressed genes and TIF miRNAs associated with tumor grade were BTRC, CHST1, miR-10a/b, miR-107, miR-301a, and miR-454. CONCLUSION: Integration of IF miRNAs and mRNAs unveiled networks associated with patient clinicopathological traits, and underlined molecular mechanisms, specific to BC sub-groups. Our results highlight the benefits of an integrative approach to biomarker discovery, placing secreted miRNAs within a biological context.

Subject(s)

Lymphocytes, Tumor-Infiltrating/immunology , MicroRNAs/genetics , Triple Negative Breast Neoplasms/genetics , Biomarkers, Tumor/genetics , Biomarkers, Tumor/immunology , Extracellular Fluid/metabolism , Female , Follow-Up Studies , Gene Expression Profiling , Gene Regulatory Networks , Humans , Lymphocytes, Tumor-Infiltrating/metabolism , MicroRNAs/metabolism , Neoplasm Grading , Receptor, ErbB-2/metabolism , Receptors, Estrogen/metabolism , Receptors, Progesterone/metabolism , Triple Negative Breast Neoplasms/immunology , Triple Negative Breast Neoplasms/pathology , Tumor Microenvironment/genetics , Tumor Microenvironment/immunology

12.

CAncer bioMarker Prediction Pipeline (CAMPP)-A standardized framework for the analysis of quantitative biological data.

Terkelsen, Thilde; Krogh, Anders; Papaleo, Elena.

PLoS Comput Biol ; 16(3): e1007665, 2020 03.

Article in English | MEDLINE | ID: mdl-32176694

ABSTRACT

With the improvement of -omics and next-generation sequencing (NGS) methodologies, along with the lowered cost of generating these types of data, the analysis of high-throughput biological data has become standard both for forming and testing biomedical hypotheses. Our knowledge of how to normalize datasets to remove latent undesirable variances has grown extensively, making for standardized data that are easily compared between studies. Here we present the CAncer bioMarker Prediction Pipeline (CAMPP), an open-source R-based wrapper (https://github.com/ELELAB/CAncer-bioMarker-Prediction-Pipeline -CAMPP) intended to aid bioinformatic software-users with data analyses. CAMPP is called from a terminal command line and is supported by a user-friendly manual. The pipeline may be run on a local computer and requires little or no knowledge of programming. To avoid issues relating to R-package updates, a renv .lock file is provided to ensure R-package stability. Data-management includes missing value imputation, data normalization, and distributional checks. CAMPP performs (I) k-means clustering, (II) differential expression/abundance analysis, (III) elastic-net regression, (IV) correlation and co-expression network analyses, (V) survival analysis, and (VI) protein-protein/miRNA-gene interaction networks. The pipeline returns tabular files and graphical representations of the results. We hope that CAMPP will assist in streamlining bioinformatic analysis of quantitative biological data, whilst ensuring an appropriate bio-statistical framework.

Subject(s)

Biomarkers, Tumor/analysis , Computational Biology/methods , Neoplasms , Software , Cluster Analysis , Databases, Factual , Humans , Neoplasms/chemistry , Neoplasms/genetics , Neoplasms/mortality , User-Computer Interface

13.

Sensitive detection of circular DNAs at single-nucleotide resolution using guided realignment of partially aligned reads.

Prada-Luengo, Iñigo; Krogh, Anders; Maretty, Lasse; Regenberg, Birgitte.

BMC Bioinformatics ; 20(1): 663, 2019 Dec 12.

Article in English | MEDLINE | ID: mdl-31830908

ABSTRACT

BACKGROUND: Circular DNA has recently been identified across different species including human normal and cancerous tissue, but short-read mappers are unable to align many of the reads crossing circle junctions hence limiting their detection from short-read sequencing data. RESULTS: Here, we propose a new method, Circle-Map that guides the realignment of partially aligned reads using information from discordantly mapped reads to map the short unaligned portions using a probabilistic model. We compared Circle-Map to similar up-to-date methods for circular DNA and RNA detection and we demonstrate how the approach implemented in Circle-Map dramatically increases sensitivity for detection of circular DNA on both simulated and real data while retaining high precision. CONCLUSION: Circle-Map is an easy-to-use command line tool that implements the required pipeline to accurately detect circular DNA from circle enriched next generation sequencing experiments. Circle-Map is implemented in python3.6 and it is freely available at https://github.com/iprada/Circle-Map.

Subject(s)

DNA, Circular/genetics , Nucleotides/genetics , Sequence Alignment/methods , Databases, Genetic , Humans , Software

14.

Genomic Analyses from Non-invasive Prenatal Testing Reveal Genetic Associations, Patterns of Viral Infections, and Chinese Population History.

Liu, Siyang; Huang, Shujia; Chen, Fang; Zhao, Lijian; Yuan, Yuying; Francis, Stephen Starko; Fang, Lin; Li, Zilong; Lin, Long; Liu, Rong; Zhang, Yong; Xu, Huixin; Li, Shengkang; Zhou, Yuwen; Davies, Robert W; Liu, Qiang; Walters, Robin G; Lin, Kuang; Ju, Jia; Korneliussen, Thorfinn; Yang, Melinda A; Fu, Qiaomei; Wang, Jun; Zhou, Lijun; Krogh, Anders; Zhang, Hongyun; Wang, Wei; Chen, Zhengming; Cai, Zhiming; Yin, Ye; Yang, Huanming; Mao, Mao; Shendure, Jay; Wang, Jian; Albrechtsen, Anders; Jin, Xin; Nielsen, Rasmus; Xu, Xun.

Cell ; 175(2): 347-359.e14, 2018 10 04.

Article in English | MEDLINE | ID: mdl-30290141

ABSTRACT

We analyze whole-genome sequencing data from 141,431 Chinese women generated for non-invasive prenatal testing (NIPT). We use these data to characterize the population genetic structure and to investigate genetic associations with maternal and infectious traits. We show that the present day distribution of alleles is a function of both ancient migration and very recent population movements. We reveal novel phenotype-genotype associations, including several replicated associations with height and BMI, an association between maternal age and EMB, and between twin pregnancy and NRG1. Finally, we identify a unique pattern of circulating viral DNA in plasma with high prevalence of hepatitis B and other clinically relevant maternal infections. A GWAS for viral infections identifies an exceptionally strong association between integrated herpesvirus 6 and MOV10L1, which affects piwi-interacting RNA (piRNA) processing and PIWI protein function. These findings demonstrate the great value and potential of accumulating NIPT data for worldwide medical and genetic analyses.

Subject(s)

Asian People/genetics , Prenatal Diagnosis/methods , Adult , Alleles , China , DNA/genetics , Ethnicity/genetics , Female , Gene Frequency/genetics , Genetic Testing , Genetic Variation/genetics , Genetics, Population/methods , Genome-Wide Association Study/methods , Genomics/methods , Human Migration , Humans , Pregnancy , Sequence Analysis, DNA

15.

Accurate genotyping across variant classes and lengths using variant graphs.

Sibbesen, Jonas Andreas; Maretty, Lasse; Krogh, Anders.

Nat Genet ; 50(7): 1054-1059, 2018 07.

Article in English | MEDLINE | ID: mdl-29915429

ABSTRACT

Genotype estimates from short-read sequencing data are typically based on the alignment of reads to a linear reference, but reads originating from more complex variants (for example, structural variants) often align poorly, resulting in biased genotype estimates. This bias can be mitigated by first collecting a set of candidate variants across discovery methods, individuals and databases, and then realigning the reads to the variants and reference simultaneously. However, this realignment problem has proved computationally difficult. Here, we present a new method (BayesTyper) that uses exact alignment of read k-mers to a graph representation of the reference and variants to efficiently perform unbiased, probabilistic genotyping across the variation spectrum. We demonstrate that BayesTyper generally provides superior variant sensitivity and genotyping accuracy relative to existing methods when used to integrate variants across discovery approaches and individuals. Finally, we demonstrate that including a 'variation-prior' database containing already known variants significantly improves sensitivity.

Subject(s)

Genetic Variation/genetics , Genome, Human/genetics , Genotype , High-Throughput Nucleotide Sequencing/methods , Humans , Sequence Analysis, DNA/methods

16.

Sugar Metabolism of the First Thermophilic Planctomycete Thermogutta terrifontis: Comparative Genomic and Transcriptomic Approaches.

Elcheninov, Alexander G; Menzel, Peter; Gudbergsdottir, Soley R; Slesarev, Alexei I; Kadnikov, Vitaly V; Krogh, Anders; Bonch-Osmolovskaya, Elizaveta A; Peng, Xu; Kublanov, Ilya V.

Front Microbiol ; 8: 2140, 2017.

Article in English | MEDLINE | ID: mdl-29163426

ABSTRACT

Xanthan gum, a complex polysaccharide comprising glucose, mannose and glucuronic acid residues, is involved in numerous biotechnological applications in cosmetics, agriculture, pharmaceuticals, food and petroleum industries. Additionally, its oligosaccharides were shown to possess antimicrobial, antioxidant, and few other properties. Yet, despite its extensive usage, little is known about xanthan gum degradation pathways and mechanisms. Thermogutta terrifontis, isolated from a sample of microbial mat developed in a terrestrial hot spring of Kunashir island (Far-East of Russia), was described as the first thermophilic representative of the Planctomycetes phylum. It grows well on xanthan gum either at aerobic or anaerobic conditions. Genomic analysis unraveled the pathways of oligo- and polysaccharides utilization, as well as the mechanisms of aerobic and anaerobic respiration. The combination of genomic and transcriptomic approaches suggested a novel xanthan gum degradation pathway which involves novel glycosidase(s) of DUF1080 family, hydrolyzing xanthan gum backbone beta-glucosidic linkages and beta-mannosidases instead of xanthan lyases, catalyzing cleavage of terminal beta-mannosidic linkages. Surprisingly, the genes coding DUF1080 proteins were abundant in T. terrifontis and in many other Planctomycetes genomes, which, together with our observation that xanthan gum being a selective substrate for many planctomycetes, suggest crucial role of DUF1080 in xanthan gum degradation. Our findings shed light on the metabolism of the first thermophilic planctomycete, capable to degrade a number of polysaccharides, either aerobically or anaerobically, including the biotechnologically important bacterial polysaccharide xanthan gum.

17.

Erratum.

Fox, Keith A A; Gersh, Bernard J; Traore, Sory; John Camm, A; Kayani, Gloria; Krogh, Anders; Shweta, Shweta; Kakkar, Ajay K.

Eur Heart J Qual Care Clin Outcomes ; 3(4): 328, 2017 10 01.

Article in English | MEDLINE | ID: mdl-29044400

18.

Evolving quality standards for large-scale registries: the GARFIELD-AF experience.

Fox, Keith A A; Gersh, Bernard J; Traore, Sory; John Camm, A; Kayani, Gloria; Krogh, Anders; Shweta, Shweta; Kakkar, Ajay K.

Eur Heart J Qual Care Clin Outcomes ; 3(2): 114-122, 2017 04 01.

Article in English | MEDLINE | ID: mdl-28927171

ABSTRACT

Aims: Registries have the potential to capture treatment practices and outcomes in populations beyond the constraints of clinical trial settings. The value of data obtained depend critically upon robust quality standards (including source data verification [SDV] and training); features that are commonly absent from registries. This article outlines the quality standards developed for Global Anticoagulant Registry in the FIELD-Atrial Fibrillation (GARFIELD-AF). Methods and Results: GARFIELD-AF comprises â¼57 000 patients prospectively recruited over 6.5 years in 35 countries in five successive cohorts. The registry employs a combination of remote and onsite monitoring to ascertain completeness and accuracy of records and by design, SDV is performed on 20% of cases (i.e. â¼11 400 patients). Four performance measures for ranking sites according to data quality and other performance indicators were evaluated (including data quality for 13 quantifiable variables, late data locking, number of missing critical variables, and history of poor data quality from the previous monitoring phase). These criteria facilitated the identification of sites with potentially suboptimal data quality for onsite monitoring. During early phases of the registry, critical variables for data checking were also identified. SDV using these variables (partial SDV in 902 patients) showed similar concordance to SDV of all fields (110 patients): 94.4% vs. 93.1%, respectively. This standard formed the baseline against which ongoing quality improvements were assessed, facilitating corrective action on data quality issues. In consequence, concordance was improved in the next monitoring phase (95.6%; n = 1172). Conclusion: The quality standards in GARFIELD-AF have the potential to inform a future 'reference' for registries.

Subject(s)

Anticoagulants/therapeutic use , Atrial Fibrillation/drug therapy , Data Accuracy , Registries/standards , Stroke/prevention & control , Humans , Prospective Studies , Risk Factors

19.

Sequencing and de novo assembly of 150 genomes from Denmark as a population reference.

Maretty, Lasse; Jensen, Jacob Malte; Petersen, Bent; Sibbesen, Jonas Andreas; Liu, Siyang; Villesen, Palle; Skov, Laurits; Belling, Kirstine; Theil Have, Christian; Izarzugaza, Jose M G; Grosjean, Marie; Bork-Jensen, Jette; Grove, Jakob; Als, Thomas D; Huang, Shujia; Chang, Yuqi; Xu, Ruiqi; Ye, Weijian; Rao, Junhua; Guo, Xiaosen; Sun, Jihua; Cao, Hongzhi; Ye, Chen; van Beusekom, Johan; Espeseth, Thomas; Flindt, Esben; Friborg, Rune M; Halager, Anders E; Le Hellard, Stephanie; Hultman, Christina M; Lescai, Francesco; Li, Shengting; Lund, Ole; Løngren, Peter; Mailund, Thomas; Matey-Hernandez, Maria Luisa; Mors, Ole; Pedersen, Christian N S; Sicheritz-Pontén, Thomas; Sullivan, Patrick; Syed, Ali; Westergaard, David; Yadav, Rachita; Li, Ning; Xu, Xun; Hansen, Torben; Krogh, Anders; Bolund, Lars; Sørensen, Thorkild I A; Pedersen, Oluf.

Nature ; 548(7665): 87-91, 2017 08 03.

Article in English | MEDLINE | ID: mdl-28746312

ABSTRACT

Hundreds of thousands of human genomes are now being sequenced to characterize genetic variation and use this information to augment association mapping studies of complex disorders and other phenotypic traits. Genetic variation is identified mainly by mapping short reads to the reference genome or by performing local assembly. However, these approaches are biased against discovery of structural variants and variation in the more complex parts of the genome. Hence, large-scale de novo assembly is needed. Here we show that it is possible to construct excellent de novo assemblies from high-coverage sequencing with mate-pair libraries extending up to 20 kilobases. We report de novo assemblies of 150 individuals (50 trios) from the GenomeDenmark project. The quality of these assemblies is similar to those obtained using the more expensive long-read technology. We use the assemblies to identify a rich set of structural variants including many novel insertions and demonstrate how this variant catalogue enables further deciphering of known association mapping signals. We leverage the assemblies to provide 100 completely resolved major histocompatibility complex haplotypes and to resolve major parts of the Y chromosome. Our study provides a regional reference genome that we expect will improve the power of future association mapping studies and hence pave the way for precision medicine initiatives, which now are being launched in many countries including Denmark.

Subject(s)

Genetic Variation/genetics , Genetics, Population/standards , Genome, Human/genetics , Genomics/standards , Sequence Analysis, DNA/standards , Adult , Alleles , Child , Chromosomes, Human, Y/genetics , Denmark , Female , Haplotypes/genetics , Humans , Major Histocompatibility Complex/genetics , Male , Maternal Age , Mutation Rate , Paternal Age , Point Mutation/genetics , Reference Standards

20.

Highly accessible AU-rich regions in 3' untranslated regions are hotspots for binding of regulatory factors.

Plass, Mireya; Rasmussen, Simon H; Krogh, Anders.

PLoS Comput Biol ; 13(4): e1005460, 2017 04.

Article in English | MEDLINE | ID: mdl-28410363

ABSTRACT

Post-transcriptional regulation is regarded as one of the major processes involved in the regulation of gene expression. It is mainly performed by RNA binding proteins and microRNAs, which target RNAs and typically affect their stability. Recent efforts from the scientific community have aimed at understanding post-transcriptional regulation at a global scale by using high-throughput sequencing techniques such as cross-linking and immunoprecipitation (CLIP), which facilitates identification of binding sites of these regulatory factors. However, the diversity in the experimental procedures and bioinformatics analyses has hindered the integration of multiple datasets and thus limited the development of an integrated view of post-transcriptional regulation. In this work, we have performed a comprehensive analysis of 107 CLIP datasets from 49 different RBPs in HEK293 cells to shed light on the complex interactions that govern post-transcriptional regulation. By developing a more stringent CLIP analysis pipeline we have discovered the existence of conserved regulatory AU-rich regions in the 3'UTRs where miRNAs and RBPs that regulate several processes such as polyadenylation or mRNA stability bind. Analogous to promoters, many factors have binding sites overlapping or in close proximity in these hotspots and hence the regulation of the mRNA may depend on their relative concentrations. This hypothesis is supported by RBP knockdown experiments that alter the relative concentration of RBPs in the cell. Upon AGO2 knockdown (KD), transcripts containing "free" target sites show increased expression levels compared to those containing target sites in hotspots, which suggests that target sites within hotspots are less available for miRNAs to bind. Interestingly, these hotspots appear enriched in genes with regulatory functions such as DNA binding and RNA binding. Taken together, our results suggest that hotspots are functional regulatory elements that define an extra layer of regulation of post-transcriptional regulatory networks.

Subject(s)

3' Untranslated Regions/genetics , Binding Sites/genetics , MicroRNAs/genetics , RNA-Binding Proteins/genetics , Computational Biology , HEK293 Cells , Humans , Immunoprecipitation , MicroRNAs/metabolism , Polyadenylation/genetics , RNA-Binding Proteins/metabolism

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL