RESUMO
Cell lineage tree reconstruction methods are developed for various tasks, such as investigating the development, differentiation, and cancer progression. Single-cell sequencing technologies enable more thorough analysis with higher resolution. We present Scuphr, a distance-based cell lineage tree reconstruction method using bulk and single-cell DNA sequencing data from healthy tissues. Common challenges of single-cell DNA sequencing, such as allelic dropouts and amplification errors, are included in Scuphr. Scuphr computes the distance between cell pairs and reconstructs the lineage tree using the neighbor-joining algorithm. With its embarrassingly parallel design, Scuphr can do faster analysis than the state-of-the-art methods while obtaining better accuracy. The method's robustness is investigated using various synthetic datasets and a biological dataset of 18 cells.
Assuntos
Algoritmos , Linhagem da Célula , Biologia Computacional , Análise de Célula Única , Linhagem da Célula/genética , Biologia Computacional/métodos , Modelos Estatísticos , Análise de Sequência de DNA/métodos , Análise de Célula Única/métodos , SoftwareRESUMO
MOTIVATION: Copy number variations (CNVs) are common genetic alterations in tumour cells. The delineation of CNVs holds promise for enhancing our comprehension of cancer progression. Moreover, accurate inference of CNVs from single-cell sequencing data is essential for unravelling intratumoral heterogeneity. However, existing inference methods face limitations in resolution and sensitivity. RESULTS: To address these challenges, we present CopyVAE, a deep learning framework based on a variational autoencoder architecture. Through experiments, we demonstrated that CopyVAE can accurately and reliably detect CNVs from data obtained using single-cell RNA sequencing. CopyVAE surpasses existing methods in terms of sensitivity and specificity. We also discussed CopyVAE's potential to advance our understanding of genetic alterations and their impact on disease advancement. AVAILABILITY AND IMPLEMENTATION: CopyVAE is implemented and freely available under MIT license at https://github.com/kurtsemih/copyVAE.
Assuntos
Variações do Número de Cópias de DNA , Análise de Célula Única , Análise de Célula Única/métodos , Humanos , Aprendizado Profundo , Software , Transcriptoma/genética , Análise de Sequência de RNA/métodos , Neoplasias/genéticaRESUMO
The spatial distribution of lymphocyte clones within tissues is critical to their development, selection, and expansion. We have developed spatial transcriptomics of variable, diversity, and joining (VDJ) sequences (Spatial VDJ), a method that maps B cell and T cell receptor sequences in human tissue sections. Spatial VDJ captures lymphocyte clones that match canonical B and T cell distributions and amplifies clonal sequences confirmed by orthogonal methods. We found spatial congruency between paired receptor chains, developed a computational framework to predict receptor pairs, and linked the expansion of distinct B cell clones to different tumor-associated gene expression programs. Spatial VDJ delineates B cell clonal diversity and lineage trajectories within their anatomical niche. Thus, Spatial VDJ captures lymphocyte spatial clonal architecture across tissues, providing a platform to harness clonal sequences for therapy.
Assuntos
Linfócitos B , Receptores de Células Precursoras de Linfócitos B , Receptores de Antígenos de Linfócitos T , Linfócitos T , Humanos , Linfócitos B/metabolismo , Células Clonais/metabolismo , Perfilação da Expressão Gênica/métodos , Receptores de Células Precursoras de Linfócitos B/genética , Receptores de Antígenos de Linfócitos T/genética , Linfócitos T/metabolismoRESUMO
Functional characterization of the cancer clones can shed light on the evolutionary mechanisms driving cancer's proliferation and relapse mechanisms. Single-cell RNA sequencing data provide grounds for understanding the functional state of cancer as a whole; however, much research remains to identify and reconstruct clonal relationships toward characterizing the changes in functions of individual clones. We present PhylEx that integrates bulk genomics data with co-occurrences of mutations from single-cell RNA sequencing data to reconstruct high-fidelity clonal trees. We evaluate PhylEx on synthetic and well-characterized high-grade serous ovarian cancer cell line datasets. PhylEx outperforms the state-of-the-art methods both when comparing capacity for clonal tree reconstruction and for identifying clones. We analyze high-grade serous ovarian cancer and breast cancer data to show that PhylEx exploits clonal expression profiles beyond what is possible with expression-based clustering methods and clear the way for accurate inference of clonal trees and robust phylo-phenotypic analysis of cancer.
Assuntos
Neoplasias Ovarianas , Árvores , Feminino , Humanos , Árvores/genética , Transcriptoma , Evolução Clonal , Recidiva Local de Neoplasia , Neoplasias Ovarianas/genética , Células Clonais , Análise de Célula Única/métodosRESUMO
Breast cancer (BC) is a complex disease comprising multiple distinct subtypes with different genetic features and pathological characteristics. Although a large number of antineoplastic compounds have been approved for clinical use, patient-to-patient variability in drug response is frequently observed, highlighting the need for efficient treatment prediction for individualized therapy. Several patient-derived models have been established lately for the prediction of drug response. However, each of these models has its limitations that impede their clinical application. Here, we report that the whole-tumor cell culture (WTC) ex vivo model could be stably established from all breast tumors with a high success rate (98 out of 116), and it could reassemble the parental tumors with the endogenous microenvironment. We observed strong clinical associations and predictive values from the investigation of a broad range of BC therapies with WTCs derived from a patient cohort. The accuracy was further supported by the correlation between WTC-based test results and patients' clinical responses in a separate validation study, where the neoadjuvant treatment regimens of 15 BC patients were mimicked. Collectively, the WTC model allows us to accomplish personalized drug testing within 10 d, even for small-sized tumors, highlighting its potential for individualized BC therapy. Furthermore, coupled with genomic and transcriptomic analyses, WTC-based testing can also help to stratify specific patient groups for assignment into appropriate clinical trials, as well as validate potential biomarkers during drug development.
Assuntos
Antineoplásicos , Neoplasias da Mama , Humanos , Feminino , Neoplasias da Mama/tratamento farmacológico , Neoplasias da Mama/genética , Neoplasias da Mama/patologia , Antineoplásicos/farmacologia , Antineoplásicos/uso terapêutico , Perfilação da Expressão Gênica , Biomarcadores , Técnicas de Cultura de Células , Microambiente TumoralRESUMO
Identifying the interrelations among cancer driver genes and the patterns in which the driver genes get mutated is critical for understanding cancer. In this paper, we study cross-sectional data from cohorts of tumors to identify the cancer-type (or subtype) specific process in which the cancer driver genes accumulate critical mutations. We model this mutation accumulation process using a tree, where each node includes a driver gene or a set of driver genes. A mutation in each node enables its children to have a chance of mutating. This model simultaneously explains the mutual exclusivity patterns observed in mutations in specific cancer genes (by its nodes) and the temporal order of events (by its edges). We introduce a computationally efficient dynamic programming procedure for calculating the likelihood of our noisy datasets and use it to build our Markov Chain Monte Carlo (MCMC) inference algorithm, ToMExO. Together with a set of engineered MCMC moves, our fast likelihood calculations enable us to work with datasets with hundreds of genes and thousands of tumors, which cannot be dealt with using available cancer progression analysis methods. We demonstrate our method's performance on several synthetic datasets covering various scenarios for cancer progression dynamics. Then, a comparison against two state-of-the-art methods on a moderate-size biological dataset shows the merits of our algorithm in identifying significant and valid patterns. Finally, we present our analyses of several large biological datasets, including colorectal cancer, glioblastoma, and pancreatic cancer. In all the analyses, we validate the results using a set of method-independent metrics testing the causality and significance of the relations identified by ToMExO or competing methods.
Assuntos
Glioblastoma , Neoplasias , Criança , Humanos , Estudos Transversais , Neoplasias/genética , Neoplasias/patologia , Processos Neoplásicos , Algoritmos , Método de Monte Carlo , Mutação , Glioblastoma/genéticaRESUMO
Identification of mutations of the genes that give cancer a selective advantage is an important step towards research and clinical objectives. As such, there has been a growing interest in developing methods for identification of driver genes and their temporal order within a single patient (intra-tumor) as well as across a cohort of patients (inter-tumor). In this paper, we develop a probabilistic model for tumor progression, in which the driver genes are clustered into several ordered driver pathways. We develop an efficient inference algorithm that exhibits favorable scalability to the number of genes and samples compared to a previously introduced ILP-based method. Adopting a probabilistic approach also allows principled approaches to model selection and uncertainty quantification. Using a large set of experiments on synthetic datasets, we demonstrate our superior performance compared to the ILP-based method. We also analyze two biological datasets of colorectal and glioblastoma cancers. We emphasize that while the ILP-based method puts many seemingly passenger genes in the driver pathways, our algorithm keeps focused on truly driver genes and outputs more accurate models for cancer progression.
Assuntos
Genes Neoplásicos/genética , Modelos Estatísticos , Neoplasias/genética , Neoplasias/patologia , Algoritmos , Biologia Computacional , Bases de Dados Genéticas , Progressão da Doença , Humanos , Mutação/genéticaRESUMO
Intra-tumor heterogeneity is one of the biggest challenges in cancer treatment today. Here we investigate tissue-wide gene expression heterogeneity throughout a multifocal prostate cancer using the spatial transcriptomics (ST) technology. Utilizing a novel approach for deconvolution, we analyze the transcriptomes of nearly 6750 tissue regions and extract distinct expression profiles for the different tissue components, such as stroma, normal and PIN glands, immune cells and cancer. We distinguish healthy and diseased areas and thereby provide insight into gene expression changes during the progression of prostate cancer. Compared to pathologist annotations, we delineate the extent of cancer foci more accurately, interestingly without link to histological changes. We identify gene expression gradients in stroma adjacent to tumor regions that allow for re-stratification of the tumor microenvironment. The establishment of these profiles is the first step towards an unbiased view of prostate cancer and can serve as a dictionary for future studies.
Assuntos
Adenocarcinoma/genética , Regulação Neoplásica da Expressão Gênica , Neoplasias da Próstata/genética , Transcriptoma/genética , Adenocarcinoma/patologia , Adenocarcinoma/cirurgia , Biologia Computacional , Progressão da Doença , Perfilação da Expressão Gênica , Humanos , Masculino , Próstata/citologia , Próstata/patologia , Próstata/cirurgia , Prostatectomia , Neoplasias da Próstata/patologia , Neoplasias da Próstata/cirurgia , RNA Mensageiro/genética , Células Estromais/patologia , Microambiente Tumoral/genéticaRESUMO
Metastatic breast cancers are still incurable. Characterizing the evolutionary landscape of these cancers, including the role of metastatic axillary lymph nodes (ALNs) in seeding distant organ metastasis, can provide a rational basis for effective treatments. Here, we have described the genomic analyses of the primary tumors and metastatic lesions from 99 samples obtained from 20 patients with breast cancer. Our evolutionary analyses revealed diverse spreading and seeding patterns that govern tumor progression. Although linear evolution to successive metastatic sites was common, parallel evolution from the primary tumor to multiple distant sites was also evident. Metastatic spreading was frequently coupled with polyclonal seeding, in which multiple metastatic subclones originated from the primary tumor and/or other distant metastases. Synchronous ALN metastasis, a well-established prognosticator of breast cancer, was not involved in seeding the distant metastasis, suggesting a hematogenous route for cancer dissemination. Clonal evolution coincided frequently with emerging driver alterations and evolving mutational processes, notably an increase in apolipoprotein B mRNA-editing enzyme, catalytic polypeptide-like-associated (APOBEC-associated) mutagenesis. Our data provide genomic evidence for a role of ALN metastasis in seeding distant organ metastasis and elucidate the evolving mutational landscape during cancer progression.
Assuntos
Neoplasias da Mama/genética , Evolução Molecular , Mutação , Neoplasias da Mama/mortalidade , Neoplasias da Mama/patologia , Feminino , Humanos , Linfonodos/metabolismo , Linfonodos/patologia , Metástase Linfática , Metástase NeoplásicaRESUMO
Cancer arises when pathways that control cell functions such as proliferation and migration are dysregulated to such an extent that cells start to divide uncontrollably and eventually spread throughout the body, ultimately endangering the survival of an affected individual. It is well established that somatic mutations are important in cancer initiation and progression as well as in creation of tumor diversity. Now also modifications of the transcriptome are emerging as a significant force during the transition from normal cell to malignant tumor. Editing of adenosine (A) to inosine (I) in double-stranded RNA, catalyzed by adenosine deaminases acting on RNA (ADARs), is one dynamic modification that in a combinatorial manner can give rise to a very diverse transcriptome. Since the cell interprets inosine as guanosine (G), editing can result in non-synonymous codon changes in transcripts as well as yield alternative splicing, but also affect targeting and disrupt maturation of microRNA. ADAR editing is essential for survival in mammals but its dysregulation can lead to cancer. ADAR1 is for instance overexpressed in, e.g., lung cancer, liver cancer, esophageal cancer and chronic myoelogenous leukemia, which with few exceptions promotes cancer progression. In contrast, ADAR2 is lowly expressed in e.g. glioblastoma, where the lower levels of ADAR2 editing leads to malignant phenotypes. Altogether, RNA editing by the ADAR enzymes is a powerful regulatory mechanism during tumorigenesis. Depending on the cell type, cancer progression seems to mainly be induced by ADAR1 upregulation or ADAR2 downregulation, although in a few cases ADAR1 is instead downregulated. In this review, we discuss how aberrant editing of specific substrates contributes to malignancy.
Assuntos
Adenosina Desaminase/metabolismo , Neoplasias/genética , Edição de RNA , RNA de Cadeia Dupla/genética , Proteínas de Ligação a RNA/metabolismo , Animais , Progressão da Doença , Regulação Neoplásica da Expressão Gênica , Humanos , Neoplasias/metabolismo , Neoplasias/patologia , Isoformas de RNA/genética , Isoformas de RNA/metabolismo , RNA de Cadeia Dupla/metabolismoRESUMO
It has for a long time been known that repetitive elements, particularly Alu sequences in human, are edited by the adenosine deaminases acting on RNA, ADAR, family. The functional interpretation of these events has been even more difficult than that of editing events in coding sequences, but today there is an emerging understanding of their downstream effects. A surprisingly large fraction of the human transcriptome contains inverted Alu repeats, often forming long double stranded structures in RNA transcripts, typically occurring in introns and UTRs of protein coding genes. Alu repeats are also common in other primates, and similar inverted repeats can frequently be found in non-primates, although the latter are less prone to duplex formation. In human, as many as 700,000 Alu elements have been identified as substrates for RNA editing, of which many are edited at several sites. In fact, recent advancements in transcriptome sequencing techniques and bioinformatics have revealed that the human editome comprises at least a hundred million adenosine to inosine (A-to-I) editing sites in Alu sequences. Although substantial additional efforts are required in order to map the editome, already present knowledge provides an excellent starting point for studying cis-regulation of editing. In this review, we will focus on editing of long stem loop structures in the human transcriptome and how it can effect gene expression.
Assuntos
Elementos Alu/genética , Regulação da Expressão Gênica , Edição de RNA , RNA não Traduzido/genética , Transcriptoma/genética , Animais , Humanos , Íntrons/genética , Modelos Genéticos , PrimatasRESUMO
Cancer can be a result of accumulation of different types of genetic mutations such as copy number aberrations. The data from tumors are cross-sectional and do not contain the temporal order of the genetic events. Finding the order in which the genetic events have occurred and progression pathways are of vital importance in understanding the disease. In order to model cancer progression, we propose Progression Networks, a special case of Bayesian networks, that are tailored to model disease progression. Progression networks have similarities with Conjunctive Bayesian Networks (CBNs) [1],a variation of Bayesian networks also proposed for modeling disease progression. We also describe a learning algorithm for learning Bayesian networks in general and progression networks in particular. We reduce the hard problem of learning the Bayesian and progression networks to Mixed Integer Linear Programming (MILP). MILP is a Non-deterministic Polynomial-time complete (NP-complete) problem for which very good heuristics exists. We tested our algorithm on synthetic and real cytogenetic data from renal cell carcinoma. We also compared our learned progression networks with the networks proposed in earlier publications. The software is available on the website https://bitbucket.org/farahani/diprog.
Assuntos
Carcinogênese/genética , Carcinoma de Células Renais/genética , Neoplasias Renais/genética , Modelos Biológicos , Software , Teorema de Bayes , Carcinoma de Células Renais/patologia , Aberrações Cromossômicas , Hibridização Genômica Comparativa , Progressão da Doença , Humanos , Neoplasias Renais/patologia , Oncogenes , Programação LinearRESUMO
Adenosine-to-inosine (A-to-I) RNA editing targets double-stranded RNA stem-loop structures in the mammalian brain. It has previously been shown that miRNAs are substrates for A-to-I editing. For the first time, we show that for several definitions of edited miRNA, the level of editing increases with development, thereby indicating a regulatory role for editing during brain maturation. We use high-throughput RNA sequencing to determine editing levels in mature miRNA, from the mouse transcriptome, and compare these with the levels of editing in pri-miRNA. We show that increased editing during development gradually changes the proportions of the two miR-376a isoforms, which previously have been shown to have different targets. Several other miRNAs that also are edited in the seed sequence show an increased level of editing through development. By comparing editing of pri-miRNA with editing and expression of the corresponding mature miRNA, we also show an editing-induced developmental regulation of miRNA expression. Taken together, our results imply that RNA editing influences the miRNA repertoire during brain maturation.
Assuntos
Adenosina/metabolismo , Encéfalo/metabolismo , Regulação da Expressão Gênica no Desenvolvimento , Inosina/metabolismo , MicroRNAs/metabolismo , Edição de RNA , Adenosina/genética , Animais , Sequência de Bases , Encéfalo/embriologia , Encéfalo/crescimento & desenvolvimento , Biologia Computacional , Dendritos/genética , Dendritos/metabolismo , Embrião de Mamíferos/metabolismo , Desenvolvimento Embrionário/genética , Sequenciamento de Nucleotídeos em Larga Escala , Inosina/genética , Camundongos , MicroRNAs/genética , Isoformas de RNA/genética , Isoformas de RNA/metabolismo , Proteínas de Ligação a RNA/genética , Proteínas de Ligação a RNA/metabolismo , TranscriptomaRESUMO
Macrophages play a critical role in innate immunity, and the expression of early response genes orchestrate much of the initial response of the immune system. Macrophages undergo extensive transcriptional reprogramming in response to inflammatory stimuli such as Lipopolysaccharide (LPS).To identify gene transcription regulation patterns involved in early innate immune responses, we used two genome-wide approaches--gene expression profiling and chromatin immunoprecipitation-sequencing (ChIP-seq) analysis. We examined the effect of 2 hrs LPS stimulation on early gene expression and its relation to chromatin remodeling (H3 acetylation; H3Ac) and promoter binding of Sp1 and RNA polymerase II phosphorylated at serine 5 (S5P RNAPII), which is a marker for transcriptional initiation. Our results indicate novel and alternative gene regulatory mechanisms for certain proinflammatory genes. We identified two groups of up-regulated inflammatory genes with respect to chromatin modification and promoter features. One group, including highly up-regulated genes such as tumor necrosis factor (TNF), was characterized by H3Ac, high CpG content and lack of TATA boxes. The second group, containing inflammatory mediators (interleukins and CCL chemokines), was up-regulated upon LPS stimulation despite lacking H3Ac in their annotated promoters, which were low in CpG content but did contain TATA boxes. Genome-wide analysis showed that few H3Ac peaks were unique to either +/-LPS condition. However, within these, an unpacking/expansion of already existing H3Ac peaks was observed upon LPS stimulation. In contrast, a significant proportion of S5P RNAPII peaks (approx 40%) was unique to either condition. Furthermore, data indicated a large portion of previously unannotated TSSs, particularly in LPS-stimulated macrophages, where only 28% of unique S5P RNAPII peaks overlap annotated promoters. The regulation of the inflammatory response appears to occur in a very specific manner at the chromatin level for specific genes and this study highlights the level of fine-tuning that occurs in the immune response.
Assuntos
Cromatina/química , Citocinas/metabolismo , Perfilação da Expressão Gênica , Macrófagos/metabolismo , Diferenciação Celular , Imunoprecipitação da Cromatina , Ilhas de CpG , Estudo de Associação Genômica Ampla , Histonas/química , Humanos , Sistema Imunitário , Imunidade Inata , Inflamação/genética , Macrófagos/citologia , Modelos Biológicos , Monócitos/citologia , Família Multigênica , Análise de Sequência com Séries de Oligonucleotídeos , Regiões Promotoras Genéticas , Ligação Proteica , RNA Mensageiro/metabolismo , Serina/químicaRESUMO
The insulin-like growth factor 1 receptor (IGF-1R) plays crucial roles in developmental and cancer biology. Most of its biological effects have been ascribed to its tyrosine kinase activity, which propagates signaling through the phosphatidylinositol 3-kinase and mitogen-activated protein kinase pathways. Here, we report that IGF-1 promotes the modification of IGF-1R by small ubiquitin-like modifier protein-1 (SUMO-1) and its translocation to the nucleus. Nuclear IGF-1R associated with enhancer-like elements and increased transcription in reporter assays. The SUMOylation sites of IGF-1R were identified as three evolutionarily conserved lysine residues-Lys(1025), Lys(1100), and Lys(1120)-in the beta subunit of the receptor. Mutation of these SUMO-1 sites abolished the ability of IGF-1R to translocate to the nucleus and activate transcription but did not alter its kinase-dependent signaling. Thus, we demonstrate a SUMOylation-mediated mechanism of IGF-1R signaling that has potential implications for gene regulation.
Assuntos
Transporte Ativo do Núcleo Celular , Regulação Neoplásica da Expressão Gênica , Receptor IGF Tipo 1/metabolismo , Transdução de Sinais , Núcleo Celular/metabolismo , Elementos Facilitadores Genéticos , Regulação da Expressão Gênica , Humanos , Lisina/química , Sistema de Sinalização das MAP Quinases , Melanoma/metabolismo , Modelos Biológicos , Mutação , Conformação Proteica , Neoplasias Cutâneas/metabolismoRESUMO
BACKGROUND: Several bioinformatic approaches have previously been used to find novel sites of ADAR mediated A-to-I RNA editing in human. These studies have discovered thousands of genes that are hyper-edited in their non-coding intronic regions, especially in alu retrotransposable elements, but very few substrates that are site-selectively edited in coding regions. Known RNA edited substrates suggest, however, that site selective A-to-I editing is particularly important for normal brain development in mammals. RESULTS: We have compiled a screen that enables the identification of new sites of site-selective editing, primarily in coding sequences. To avoid hyper-edited repeat regions, we applied our screen to the alu-free mouse genome. Focusing on the mouse also facilitated better experimental verification. To identify candidate sites of RNA editing, we first performed an explorative screen based on RNA structure and genomic sequence conservation. We further evaluated the results of the explorative screen by determining which transcripts were enriched for A-G mismatches between the genomic template and the expressed sequence since the editing product, inosine (I), is read as guanosine (G) by the translational machinery. For expressed sequences, we only considered coding regions to focus entirely on re-coding events. Lastly, we refined the results from the explorative screen using a novel scoring scheme based on characteristics for known A-to-I edited sites. The extent of editing in the final candidate genes was verified using total RNA from mouse brain and 454 sequencing. CONCLUSIONS: Using this method, we identified and confirmed efficient editing at one site in the Gabra3 gene. Editing was also verified at several other novel sites within candidates predicted to be edited. Five of these sites are situated in genes coding for the neuron-specific RNA binding proteins HuB and HuD.
Assuntos
Adenosina/genética , Genoma , Inosina/genética , Neurônios/metabolismo , Edição de RNA , Proteínas de Ligação a RNA/química , Adenosina/metabolismo , Elementos Alu/genética , Animais , Sequência de Bases , Biologia Computacional/métodos , Camundongos , Dados de Sequência Molecular , Filogenia , RNA/química , RNA/metabolismo , Proteínas de Ligação a RNA/metabolismo , Análise de Sequência de RNARESUMO
Chromosomal aberrations in solid tumors appear in complex patterns. It is important to understand how these patterns develop, the dynamics of the process, the temporal or even causal order between aberrations, and the involved pathways. Here we present network models for chromosomal aberrations and algorithms for training models based on observed data. Our models are generative probabilistic models that can be used to study dynamical aspects of chromosomal evolution in cancer cells. They are well suited for a graphical representation that conveys the pathways found in a dataset. By allowing only pairwise dependencies and partition aberrations into modules, in which all aberrations are restricted to have the same dependencies, we reduce the number of parameters so that datasets sizes relevant to cancer applications can be handled. We apply our framework to a dataset of colorectal cancer tumor karyotypes. The obtained model explains the data significantly better than a model where independence between the aberrations is assumed. In fact, the obtained model performs very well with respect to several measures of goodness of fit and is, with respect to repetition of the training, more or less unique.
Assuntos
Algoritmos , Transformação Celular Neoplásica , Biologia Computacional/métodos , Modelos Biológicos , Modelos Estatísticos , Neoplasias/etiologia , Biologia Computacional/estatística & dados numéricosRESUMO
Maximum likelihood (ML) (Neyman, 1971) is an increasingly popular optimality criterion for selecting evolutionary trees. Finding optimal ML trees appears to be a very hard computational task--in particular, algorithms and heuristics for ML take longer to run than algorithms and heuristics for maximum parsimony (MP). However, while MP has been known to be NP-complete for over 20 years, no such hardness result has been obtained so far for ML. In this work we make a first step in this direction by proving that ancestral maximum likelihood (AML) is NP-complete. The input to this problem is a set of aligned sequences of equal length and the goal is to find a tree and an assignment of ancestral sequences for all of that tree's internal vertices such that the likelihood of generating both the ancestral and contemporary sequences is maximized. Our NP-hardness proof follows that for MP given in (Day, Johnson and Sankoff, 1986) in that we use the same reduction from Vertex Cover; however, the proof of correctness for this reduction relative to AML is different and substantially more involved.