RESUMO
Cancer results from an evolutionary process that typically yields multiple clones with varying sets of mutations within the same tumor. Accurately modeling this process is key to understanding and predicting cancer evolution. Here, we introduce clone to mutation (CloMu), a flexible and low-parameter tree generative model of cancer evolution. CloMu uses a two-layer neural network trained via reinforcement learning to determine the probability of new mutations based on the existing mutations on a clone. CloMu supports several prediction tasks, including the determination of evolutionary trajectories, tree selection, causality and interchangeability between mutations, and mutation fitness. Importantly, previous methods support only some of these tasks, and many suffer from overfitting on data sets with a large number of mutations. Using simulations, we show that CloMu either matches or outperforms current methods on a wide variety of prediction tasks. In particular, for simulated data with interchangeable mutations, current methods are unable to uncover causal relationships as effectively as CloMu. On breast cancer and leukemia cohorts, we show that CloMu determines similarities and causal relationships between mutations as well as the fitness of mutations. We validate CloMu's inferred mutation fitness values for the leukemia cohort by comparing them to clonal proportion data not used during training, showing high concordance. In summary, CloMu's low-parameter model facilitates a wide range of prediction tasks regarding cancer evolution on increasingly available cohort-level data sets.
Assuntos
Leucemia , Neoplasias , Humanos , Neoplasias/genética , Mutação , Evolução Clonal/genética , Redes Neurais de ComputaçãoRESUMO
Emerging ultra-low coverage single-cell DNA sequencing (scDNA-seq) technologies have enabled high resolution evolutionary studies of copy number aberrations (CNAs) within tumors. While these sequencing technologies are well suited for identifying CNAs due to the uniformity of sequencing coverage, the sparsity of coverage poses challenges for the study of single-nucleotide variants (SNVs). In order to maximize the utility of increasingly available ultra-low coverage scDNA-seq data and obtain a comprehensive understanding of tumor evolution, it is important to also analyze the evolution of SNVs from the same set of tumor cells. We present Phertilizer, a method to infer a clonal tree from ultra-low coverage scDNA-seq data of a tumor. Based on a probabilistic model, our method recursively partitions the data by identifying key evolutionary events in the history of the tumor. We demonstrate the performance of Phertilizer on simulated data as well as on two real datasets, finding that Phertilizer effectively utilizes the copy-number signal inherent in the data to more accurately uncover clonal structure and genotypes compared to previous methods.
Assuntos
Neoplasias , Árvores , Humanos , Variações do Número de Cópias de DNA/genética , Neoplasias/genética , Análise de Sequência de DNA , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Célula ÚnicaRESUMO
Transcription regulatory sequences (TRSs), which occur upstream of structural and accessory genes as well as the 5' end of a coronavirus genome, play a critical role in discontinuous transcription in coronaviruses. We introduce two problems collectively aimed at identifying these regulatory sequences as well as their associated genes. First, we formulate the TRS Identification problem of identifying TRS sites in a coronavirus genome sequence with prescribed gene locations. We introduce CORSID-A, an algorithm that solves this problem to optimality in polynomial time. We demonstrate that CORSID-A outperforms existing motif-based methods in identifying TRS sites in coronaviruses. Second, we demonstrate for the first time how TRS sites can be leveraged to identify gene locations in the coronavirus genome. To that end, we formulate the TRS and Gene Identification problem of simultaneously identifying TRS sites and gene locations in unannotated coronavirus genomes. We introduce CORSID to solve this problem, which includes a web-based visualization tool to explore the space of near-optimal solutions. We show that CORSID outperforms state-of-the-art gene finding methods in coronavirus genomes. Furthermore, we demonstrate that CORSID enables de novo identification of TRS sites and genes in previously unannotated coronavirus genomes. CORSID is the first method to perform accurate and simultaneous identification of TRS sites and genes in coronavirus genomes without the use of any prior information.
Assuntos
Infecções por Coronavirus , Coronavirus , Coronavirus/genética , Infecções por Coronavirus/genética , Humanos , RNA Mensageiro/genética , RNA Viral/genética , Transcrição GênicaRESUMO
Copy-number aberrations (CNAs) are genetic alterations that amplify or delete the number of copies of large genomic segments. Although they are ubiquitous in cancer and, thus, a critical area of current cancer research, CNA identification from DNA sequencing data is challenging because it requires partitioning of the genome into complex segments with the same copy-number states that may not be contiguous. Existing segmentation algorithms address these challenges either by leveraging the local information among neighboring genomic regions, or by globally grouping genomic regions that are affected by similar CNAs across the entire genome. However, both approaches have limitations: overclustering in the case of local segmentation, or the omission of clusters corresponding to focal CNAs in the case of global segmentation. Importantly, inaccurate segmentation will lead to inaccurate identification of CNAs. For this reason, most pan-cancer research studies rely on manual procedures of quality control and anomaly correction. To improve copy-number segmentation, we introduce CNAViz, a web-based tool that enables the user to simultaneously perform local and global segmentation, thus overcoming the limitations of each approach. Using simulated data, we demonstrate that by several metrics, CNAViz allows the user to obtain more accurate segmentation relative to existing local and global segmentation methods. Moreover, we analyze six bulk DNA sequencing samples from three breast cancer patients. By validating with parallel single-cell DNA sequencing data from the same samples, we show that by using CNAViz, our user was able to obtain more accurate segmentation and improved accuracy in downstream copy-number calling.
Assuntos
Neoplasias da Mama , Neoplasias , Humanos , Feminino , Variações do Número de Cópias de DNA/genética , Neoplasias/genética , Algoritmos , Análise de Sequência de DNA , DNA de Neoplasias , Neoplasias da Mama/genéticaRESUMO
An Online tool for Fragment-based Molecule Parametrization (OFraMP) is described. OFraMP is a web application for assigning atomic interaction parameters to large molecules by matching sub-fragments within the target molecule to equivalent sub-fragments within the Automated Topology Builder (ATB, atb.uq.edu.au) database. OFraMP identifies and compares alternative molecular fragments from the ATB database, which contains over 890,000 pre-parameterized molecules, using a novel hierarchical matching procedure. Atoms are considered within the context of an extended local environment (buffer region) with the degree of similarity between an atom in the target molecule and that in the proposed match controlled by varying the size of the buffer region. Adjacent matching atoms are combined into progressively larger matched sub-structures. The user then selects the most appropriate match. OFraMP also allows users to manually alter interaction parameters and automates the submission of missing substructures to the ATB in order to generate parameters for atoms in environments not represented in the existing database. The utility of OFraMP is illustrated using the anti-cancer agent paclitaxel and a dendrimer used in organic semiconductor devices. OFraMP applied to paclitaxel (ATB ID 35922).
Assuntos
Software , Bases de Dados FactuaisRESUMO
MOTIVATION: While single-cell DNA sequencing (scDNA-seq) has enabled the study of intratumor heterogeneity at an unprecedented resolution, current technologies are error-prone and often result in doublets where two or more cells are mistaken for a single cell. Not only do doublets confound downstream analyses, but the increase in doublet rate is also a major bottleneck preventing higher throughput with current single-cell technologies. Although doublet detection and removal are standard practice in scRNA-seq data analysis, options for scDNA-seq data are limited. Current methods attempt to detect doublets while also performing complex downstream analyses tasks, leading to decreased efficiency and/or performance. RESULTS: We present doubletD, the first standalone method for detecting doublets in scDNA-seq data. Underlying our method is a simple maximum likelihood approach with a closed-form solution. We demonstrate the performance of doubletD on simulated data as well as real datasets, outperforming current methods for downstream analysis of scDNA-seq data that jointly infer doublets as well as standalone approaches for doublet detection in scRNA-seq data. Incorporating doubletD in scDNA-seq analysis pipelines will reduce complexity and lead to more accurate results. AVAILABILITY AND IMPLEMENTATION: https://github.com/elkebir-group/doubletD. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Análise de Célula Única , Software , Perfilação da Expressão Gênica , Funções Verossimilhança , Análise de Sequência de DNA , Análise de Sequência de RNARESUMO
Monitoring the prevalence of SARS-CoV-2 variants is necessary to make informed public health decisions during the COVID-19 pandemic. PCR assays have received global attention, facilitating a rapid understanding of variant dynamics because they are more accessible and scalable than genome sequencing. However, as PCR assays target only a few mutations, their accuracy could be reduced when these mutations are not exclusive to the target variants. Here we introduce PRIMES, an algorithm that evaluates the sensitivity and specificity of SARS-CoV-2 variant-specific PCR assays across different geographical regions by incorporating sequences deposited in the GISAID database. Using PRIMES, we determined that the accuracy of several PCR assays decreased when applied beyond the geographic scope of the study in which the assays were developed. Subsequently, we used this tool to design Alpha and Delta variant-specific PCR assays for samples from Illinois, USA. In silico analysis using PRIMES determined the sensitivity/specificity to be 0.99/0.99 for the Alpha variant-specific PCR assay and 0.98/1.00 for the Delta variant-specific PCR assay in Illinois, respectively. We applied these two variant-specific PCR assays to six local sewage samples and determined the dominant SARS-CoV-2 variant of either the wild type, the Alpha variant, or the Delta variant. Using next-generation sequencing (NGS) of the spike (S) gene amplicons of the Delta variant-dominant samples, we found six mutations exclusive to the Delta variant (S:T19R, S:Δ156/157, S:L452R, S:T478K, S:P681R, and S:D950N). The consistency between the variant-specific PCR assays and the NGS results supports the applicability of PRIMES. IMPORTANCE Monitoring the introduction and prevalence of variants of concern (VOCs) and variants of interest (VOIs) in a community can help the local authorities make informed public health decisions. PCR assays can be designed to keep track of SARS-CoV-2 variants by measuring unique mutation markers that are exclusive to the target variants. However, the mutation markers may not be exclusive to the target variants because of regional and temporal differences in variant dynamics. We introduce PRIMES, an algorithm that enables the design of reliable PCR assays for variant detection. Because PCR is more accessible, scalable, and robust for sewage samples than sequencing technology, our findings will contribute to improving global SARS-CoV-2 variant surveillance.
Assuntos
COVID-19 , SARS-CoV-2 , COVID-19/diagnóstico , COVID-19/epidemiologia , Humanos , Mutação , Pandemias , Reação em Cadeia da Polimerase , SARS-CoV-2/genética , EsgotosRESUMO
MOTIVATION: Cancer is caused by the accumulation of somatic mutations that lead to the formation of distinct populations of cells, called clones. The resulting clonal architecture is the main cause of relapse and resistance to treatment. With decreasing costs in DNA sequencing technology, rich cancer genomics datasets with many spatial sequencing samples are becoming increasingly available, enabling the inference of high-resolution tumor clones and prevalences across different spatial coordinates. While temporal and phylogenetic aspects of tumor evolution, such as clonal evolution over time and clonal response to treatment, are commonly visualized in various clonal evolution diagrams, visual analytics methods that reveal the spatial clonal architecture are missing. RESULTS: This article introduces ClonArch, a web-based tool to interactively visualize the phylogenetic tree and spatial distribution of clones in a single tumor mass. ClonArch uses the marching squares algorithm to draw closed boundaries representing the presence of clones in a real or simulated tumor. ClonArch enables researchers to examine the spatial clonal architecture of a subset of relevant mutations at different prevalence thresholds and across multiple phylogenetic trees. In addition to simulated tumors with varying number of biopsies, we demonstrate the use of ClonArch on a hepatocellular carcinoma tumor with â¼280 sequencing biopsies. ClonArch provides an automated way to interactively examine the spatial clonal architecture of a tumor, facilitating clinical and biological interpretations of the spatial aspects of intra-tumor heterogeneity. AVAILABILITY AND IMPLEMENTATION: https://github.com/elkebir-group/ClonArch.
Assuntos
Neoplasias , Software , Evolução Clonal/genética , Humanos , Mutação , Neoplasias/genética , Filogenia , Análise de Sequência de DNARESUMO
MOTIVATION: The combination of genomic and epidemiological data holds the potential to enable accurate pathogen transmission history inference. However, the inference of outbreak transmission histories remains challenging due to various factors such as within-host pathogen diversity and multi-strain infections. Current computational methods ignore within-host diversity and/or multi-strain infections, often failing to accurately infer the transmission history. Thus, there is a need for efficient computational methods for transmission tree inference that accommodate the complexities of real data. RESULTS: We formulate the direct transmission inference (DTI) problem for inferring transmission trees that support multi-strain infections given a timed phylogeny and additional epidemiological data. We establish hardness for the decision and counting version of the DTI problem. We introduce Transmission Tree Uniform Sampler (TiTUS), a method that uses SATISFIABILITY to almost uniformly sample from the space of transmission trees. We introduce criteria that prioritize parsimonious transmission trees that we subsequently summarize using a novel consensus tree approach. We demonstrate TiTUS's ability to accurately reconstruct transmission trees on simulated data as well as a documented HIV transmission chain. AVAILABILITY AND IMPLEMENTATION: https://github.com/elkebir-group/TiTUS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Genômica , Surtos de Doenças , FilogeniaRESUMO
MOTIVATION: While each cancer is the result of an isolated evolutionary process, there are repeated patterns in tumorigenesis defined by recurrent driver mutations and their temporal ordering. Such repeated evolutionary trajectories hold the potential to improve stratification of cancer patients into subtypes with distinct survival and therapy response profiles. However, current cancer phylogeny methods infer large solution spaces of plausible evolutionary histories from the same sequencing data, obfuscating repeated evolutionary patterns. RESULTS: To simultaneously resolve ambiguities in sequencing data and identify cancer subtypes, we propose to leverage common patterns of evolution found in patient cohorts. We first formulate the Multiple Choice Consensus Tree problem, which seeks to select a tumor tree for each patient and assign patients into clusters in such a way that maximizes consistency within each cluster of patient trees. We prove that this problem is NP-hard and develop a heuristic algorithm, Revealing Evolutionary Consensus Across Patients (RECAP), to solve this problem in practice. Finally, on simulated data, we show RECAP outperforms existing methods that do not account for patient subtypes. We then use RECAP to resolve ambiguities in patient trees and find repeated evolutionary trajectories in lung and breast cancer cohorts. AVAILABILITY AND IMPLEMENTATION: https://github.com/elkebir-group/RECAP. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Neoplasias da Mama , Carcinogênese , Consenso , Humanos , FilogeniaRESUMO
The combination of bulk and single-cell DNA sequencing data of the same tumor enables the inference of high-fidelity phylogenies that form the input to many important downstream analyses in cancer genomics. While many studies simultaneously perform bulk and single-cell sequencing, some studies have analyzed initial bulk data to identify which mutations to target in a follow-up single-cell sequencing experiment, thereby decreasing cost. Bulk data provide an additional untapped source of valuable information, composed of candidate phylogenies and associated clonal prevalence. Here, we introduce PhyDOSE, a method that uses this information to strategically optimize the design of follow-up single cell experiments. Underpinning our method is the observation that only a small number of clones uniquely distinguish one candidate tree from all other trees. We incorporate distinguishing features into a probabilistic model that infers the number of cells to sequence so as to confidently reconstruct the phylogeny of the tumor. We validate PhyDOSE using simulations and a retrospective analysis of a leukemia patient, concluding that PhyDOSE's computed number of cells resolves tree ambiguity even in the presence of typical single-cell sequencing errors. We also conduct a retrospective analysis on an acute myeloid leukemia cohort, demonstrating the potential to achieve similar results with a significant reduction in the number of cells sequenced. In a prospective analysis, we demonstrate the advantage of selecting cells to sequence across multiple biopsies and that only a small number of cells suffice to disambiguate the solution space of trees in a recent lung cancer cohort. In summary, PhyDOSE proposes cost-efficient single-cell sequencing experiments that yield high-fidelity phylogenies, which will improve downstream analyses aimed at deepening our understanding of cancer biology.
Assuntos
Biologia Computacional/métodos , Neoplasias/genética , Análise de Célula Única/métodos , Algoritmos , Evolução Molecular , Genoma/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias/classificação , Filogenia , Estudos Retrospectivos , Análise de Sequência de DNARESUMO
MOTIVATION: Cancer phylogenies are key to studying tumorigenesis and have clinical implications. Due to the heterogeneous nature of cancer and limitations in current sequencing technology, current cancer phylogeny inference methods identify a large solution space of plausible phylogenies. To facilitate further downstream analyses, methods that accurately summarize such a set T of cancer phylogenies are imperative. However, current summary methods are limited to a single consensus tree or graph and may miss important topological features that are present in different subsets of candidate trees. RESULTS: We introduce the Multiple Consensus Tree (MCT) problem to simultaneously cluster T and infer a consensus tree for each cluster. We show that MCT is NP-hard, and present an exact algorithm based on mixed integer linear programming (MILP). In addition, we introduce a heuristic algorithm that efficiently identifies high-quality consensus trees, recovering all optimal solutions identified by the MILP in simulated data at a fraction of the time. We demonstrate the applicability of our methods on both simulated and real data, showing that our approach selects the number of clusters depending on the complexity of the solution space T. AVAILABILITY AND IMPLEMENTATION: https://github.com/elkebir-group/MCT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Neoplasias , Filogenia , Consenso , Humanos , Análise de SequênciaRESUMO
Motivation: Cancer is characterized by intra-tumor heterogeneity, the presence of distinct cell populations with distinct complements of somatic mutations, which include single-nucleotide variants (SNVs) and copy-number aberrations (CNAs). Single-cell sequencing technology enables one to study these cell populations at single-cell resolution. Phylogeny estimation algorithms that employ appropriate evolutionary models are key to understanding the evolutionary mechanisms behind intra-tumor heterogeneity. Results: We introduce Single-cell Phylogeny Reconstruction (SPhyR), a method for tumor phylogeny estimation from single-cell sequencing data. In light of frequent loss of SNVs due to CNAs in cancer, SPhyR employs the k-Dollo evolutionary model, where a mutation can only be gained once but lost k times. Underlying SPhyR is a novel combinatorial characterization of solutions as constrained integer matrix completions, based on a connection to the cladistic multi-state perfect phylogeny problem. SPhyR outperforms existing methods on simulated data and on a metastatic colorectal cancer. Availability and implementation: SPhyR is available on https://github.com/elkebir-group/SPhyR. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Neoplasias/genética , Filogenia , Algoritmos , Humanos , Mutação , Análise de Sequência , Análise de Célula ÚnicaRESUMO
MOTIVATION: The human microbiome plays a key role in health and disease. Thanks to comparative metatranscriptomics, the cellular functions that are deregulated by the microbiome in disease can now be computationally explored. Unlike gene-centric approaches, pathway-based methods provide a systemic view of such functions; however, they typically consider each pathway in isolation and in its entirety. They can therefore overlook the key differences that (i) span multiple pathways, (ii) contain bidirectionally deregulated components, (iii) are confined to a pathway region. To capture these properties, computational methods that reach beyond the scope of predefined pathways are needed. RESULTS: By integrating an existing module discovery algorithm into comparative metatranscriptomic analysis, we developed metaModules, a novel computational framework for automated identification of the key functional differences between health- and disease-associated communities. Using this framework, we recovered significantly deregulated subnetworks that were indeed recognized to be involved in two well-studied, microbiome-mediated oral diseases, such as butanoate production in periodontal disease and metabolism of sugar alcohols in dental caries. More importantly, our results indicate that our method can be used for hypothesis generation based on automated discovery of novel, disease-related functional subnetworks, which would otherwise require extensive and laborious manual assessment. AVAILABILITY AND IMPLEMENTATION: metaModules is available at https://bitbucket.org/alimay/metamodules/ CONTACT: a.may@vu.nl or s.abeln@vu.nl SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Microbiota , Algoritmos , Cárie Dentária , HumanosRESUMO
MOTIVATION: DNA sequencing of multiple samples from the same tumor provides data to analyze the process of clonal evolution in the population of cells that give rise to a tumor. RESULTS: We formalize the problem of reconstructing the clonal evolution of a tumor using single-nucleotide mutations as the variant allele frequency (VAF) factorization problem. We derive a combinatorial characterization of the solutions to this problem and show that the problem is NP-complete. We derive an integer linear programming solution to the VAF factorization problem in the case of error-free data and extend this solution to real data with a probabilistic model for errors. The resulting AncesTree algorithm is better able to identify ancestral relationships between individual mutations than existing approaches, particularly in ultra-deep sequencing data when high read counts for mutations yield high confidence VAFs. AVAILABILITY AND IMPLEMENTATION: An implementation of AncesTree is available at: http://compbio.cs.brown.edu/software.
Assuntos
Algoritmos , Evolução Clonal/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Mutação/genética , Neoplasias/classificação , Neoplasias/genética , Análise de Sequência de DNA/métodos , Frequência do Gene , Humanos , Modelos EstatísticosRESUMO
MOTIVATION: Integrative network analysis methods provide robust interpretations of differential high-throughput molecular profile measurements. They are often used in a biomedical context-to generate novel hypotheses about the underlying cellular processes or to derive biomarkers for classification and subtyping. The underlying molecular profiles are frequently measured and validated on animal or cellular models. Therefore the results are not immediately transferable to human. In particular, this is also the case in a study of the recently discovered interleukin-17 producing helper T cells (Th17), which are fundamental for anti-microbial immunity but also known to contribute to autoimmune diseases. RESULTS: We propose a mathematical model for finding active subnetwork modules that are conserved between two species. These are sets of genes, one for each species, which (i) induce a connected subnetwork in a species-specific interaction network, (ii) show overall differential behavior and (iii) contain a large number of orthologous genes. We propose a flexible notion of conservation, which turns out to be crucial for the quality of the resulting modules in terms of biological interpretability. We propose an algorithm that finds provably optimal or near-optimal conserved active modules in our model. We apply our algorithm to understand the mechanisms underlying Th17 T cell differentiation in both mouse and human. As a main biological result, we find that the key regulation of Th17 differentiation is conserved between human and mouse. AVAILABILITY AND IMPLEMENTATION: xHeinz, an implementation of our algorithm, as well as all input data and results, are available at http://software.cwi.nl/xheinz and as a Galaxy service at http://services.cbib.u-bordeaux2.fr/galaxy in CBiB Tools. CONTACT: gunnar.klau@cwi.nl SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Mineração de Dados , Modelos Biológicos , Mapas de Interação de Proteínas , Animais , Perfilação da Expressão Gênica , Humanos , Recém-Nascido , Camundongos , Especificidade da Espécie , Células Th17/citologiaRESUMO
BACKGROUND: Biological networks have a growing importance for the interpretation of high-throughput "omics" data. Integrative network analysis makes use of statistical and combinatorial methods to extract smaller subnetwork modules, and performs enrichment analysis to annotate the modules with ontology terms or other available knowledge. This process results in an annotated module, which retains the original network structure and includes enrichment information as a set system. A major bottleneck is a lack of tools that allow exploring both network structure of extracted modules and its annotations. RESULTS: This paper presents a visual analysis approach that targets small modules with many set-based annotations, and which displays the annotations as contours on top of a node-link diagram. We introduce an extension of self-organizing maps to lay out nodes, links, and contours in a unified way. An implementation of this approach is freely available as the Cytoscape app eXamine CONCLUSIONS: eXamine accurately conveys small and annotated modules consisting of several dozens of proteins and annotations. We demonstrate that eXamine facilitates the interpretation of integrative network analysis results in a guided case study. This study has resulted in a novel biological insight regarding the virally-encoded G-protein coupled receptor US28.
Assuntos
Proteínas/análise , Algoritmos , Análise por Conglomerados , Modelos Biológicos , Proteínas/metabolismo , SoftwareRESUMO
Due to uncertainty in tumor phylogeny inference from sequencing data, many methods infer multiple, equally plausible phylogenies for the same cancer. To summarize the solution space T of tumor phylogenies, consensus tree methods seek a single best representative tree S under a specified pairwise tree distance function. One such distance function is the ancestor-descendant (AD) distance [Formula: see text] , which equals the size of the symmetric difference of the transitive closures of the edge sets [Formula: see text] and [Formula: see text] . Here, we show that finding a consensus tree S for tumor phylogenies T that minimizes the total AD distance [Formula: see text] is NP-hard.
Assuntos
Algoritmos , Neoplasias , Humanos , Consenso , Filogenia , IncertezaRESUMO
The design of an RNA sequence v that encodes an input target protein sequence w is a crucial aspect of messenger RNA (mRNA) vaccine development. There are an exponential number of possible RNA sequences for a single target protein due to codon degeneracy. These potential RNA sequences can assume various secondary structure conformations, each with distinct minimum free energy (MFE), impacting thermodynamic stability and mRNA half-life. Furthermore, the presence of species-specific codon usage bias, quantified by the codon adaptation index (CAI), plays a vital role in translation efficiency. While earlier studies focused on optimizing either MFE or CAI, recent research has underscored the advantages of simultaneously optimizing both objectives. However, optimizing one objective comes at the expense of the other. In this work, we present the Pareto Optimal RNA Design problem, aiming to identify the set of Pareto optimal solutions for which no alternative solutions exist that exhibit better MFE and CAI values. Our algorithm DEsign RNA (DERNA) uses the weighted sum method to enumerate the Pareto front by optimizing convex combinations of both objectives. We use dynamic programming to solve each convex combination in O(|w|3) time and O(|w|2) space. Compared with a CDSfold, previous approach that only optimizes MFE, we show on a benchmark data set that DERNA obtains solutions with identical MFE but superior CAI. Moreover, we show that DERNA matches the performance in terms of solution quality of LinearDesign, a recent approach that similarly seeks to balance MFE and CAI. We conclude by demonstrating our method's potential for mRNA vaccine design for the SARS-CoV-2 spike protein.
Assuntos
Algoritmos , RNA , Glicoproteína da Espícula de Coronavírus , Humanos , RNA/química , RNA Mensageiro , CódonRESUMO
In addition to undergoing evolution, members of biological populations may also migrate between locations. Examples include the spread of tumor cells from the primary tumor to distant metastases or the spread of pathogens from one host to another. One may represent migration histories by assigning a location label to each vertex of a given phylogenetic tree such that an edge connecting vertices with distinct locations represents a migration. Some biological populations undergo comigration, a phenomenon where multiple taxa from distinct lineages simultaneously comigrate from one location to another. In this work, we show that a previous problem statement for inferring migration histories that are parsimonious in terms of migrations and comigrations may lead to temporally inconsistent solutions. To remedy this deficiency, we introduce precise definitions of temporal consistency of comigrations in a phylogenetic tree, leading to three successive problems. First, we formulate the temporally consistent comigration problem to check if a set of comigrations is temporally consistent and provide a linear time algorithm for solving this problem. Second, we formulate the parsimonious consistent comigrations (PCC) problem, which aims to find comigrations given a location labeling of a phylogenetic tree. We show that PCC is NP-hard. Third, we formulate the parsimonious consistent comigration history (PCCH) problem, which infers the migration history given a phylogenetic tree and locations of its extant vertices only. We show that PCCH is NP-hard as well. On the positive side, we propose integer linear programming models to solve the PCC and PCCH problems. We demonstrate our algorithms on simulated and real data.