Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 47
Filtrar
1.
Microb Genom ; 10(3)2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38529944

RESUMO

Minimum Inhibitory Concentrations (MICs) are the gold standard for quantitatively measuring antibiotic resistance. However, lab-based MIC determination can be time-consuming and suffers from low reproducibility, and interpretation as sensitive or resistant relies on guidelines which change over time. Genome sequencing and machine learning promise to allow in silico MIC prediction as an alternative approach which overcomes some of these difficulties, albeit the interpretation of MIC is still needed. Nevertheless, precisely how we should handle MIC data when dealing with predictive models remains unclear, since they are measured semi-quantitatively, with varying resolution, and are typically also left- and right-censored within varying ranges. We therefore investigated genome-based prediction of MICs in the pathogen Klebsiella pneumoniae using 4367 genomes with both simulated semi-quantitative traits and real MICs. As we were focused on clinical interpretation, we used interpretable rather than black-box machine learning models, namely, Elastic Net, Random Forests, and linear mixed models. Simulated traits were generated accounting for oligogenic, polygenic, and homoplastic genetic effects with different levels of heritability. Then we assessed how model prediction accuracy was affected when MICs were framed as regression and classification. Our results showed that treating the MICs differently depending on the number of concentration levels of antibiotic available was the most promising learning strategy. Specifically, to optimise both prediction accuracy and inference of the correct causal variants, we recommend considering the MICs as continuous and framing the learning problem as a regression when the number of observed antibiotic concentration levels is large, whereas with a smaller number of concentration levels they should be treated as a categorical variable and the learning problem should be framed as a classification. Our findings also underline how predictive models can be improved when prior biological knowledge is taken into account, due to the varying genetic architecture of each antibiotic resistance trait. Finally, we emphasise that incrementing the population database is pivotal for the future clinical implementation of these models to support routine machine-learning based diagnostics.


Assuntos
Antibacterianos , Klebsiella pneumoniae , Klebsiella pneumoniae/genética , Reprodutibilidade dos Testes , Antibacterianos/farmacologia , Aprendizado de Máquina , Testes de Sensibilidade Microbiana
2.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37871178

RESUMO

SUMMARY: Fastlin is a bioinformatics tool designed for rapid Mycobacterium tuberculosis complex (MTBC) lineage typing. It utilizes an ultra-fast alignment-free approach to detect previously identified barcode single nucleotide polymorphisms associated with specific MTBC lineages. In a comprehensive benchmarking against existing tools, fastlin demonstrated high accuracy and significantly faster running times. AVAILABILITY AND IMPLEMENTATION: fastlin is freely available at https://github.com/rderelle/fastlin and can easily be installed using Conda.


Assuntos
Mycobacterium tuberculosis , Mycobacterium tuberculosis/genética , Biologia Computacional , Polimorfismo de Nucleotídeo Único , Software
3.
J Comput Biol ; 30(6): 678-694, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-37327036

RESUMO

The problem of computing the Elementary Flux Modes (EFMs) and Minimal Cut Sets (MCSs) of metabolic network is a fundamental one in metabolic networks. A key insight is that they can be understood as a dual pair of monotone Boolean functions (MBFs). Using this insight, this computation reduces to the question of generating from an oracle a dual pair of MBFs. If one of the two sets (functions) is known, then the other can be computed through a process known as dualization. Fredman and Khachiyan provided two algorithms, which they called simply A and B that can serve as an engine for oracle-based generation or dualization of MBFs. We look at efficiencies available in implementing their algorithm B, which we will refer to as FK-B. Like their algorithm A, FK-B certifies whether two given MBFs in the form of Conjunctive Normal Form and Disjunctive Normal Form are dual or not, and in case of not being dual it returns a conflicting assignment (CA), that is, an assignment that makes one of the given Boolean functions True and the other one False. The FK-B algorithm is a recursive algorithm that searches through the tree of assignments to find a CA. If it does not find any CA, it means that the given Boolean functions are dual. In this article, we propose six techniques applicable to the FK-B and hence to the dualization process. Although these techniques do not reduce the time complexity, they considerably reduce the running time in practice. We evaluate the proposed improvements by applying them to compute the MCSs from the EFMs in the 19 small- and medium-sized models from the BioModels database along with 4 models of biomass synthesis in Escherichia coli that were used in an earlier computational survey Haus et al. (2008).


Assuntos
Algoritmos , Redes e Vias Metabólicas , Escherichia coli/metabolismo , Modelos Biológicos
4.
PLoS Comput Biol ; 19(6): e1011129, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37347768

RESUMO

The increasing availability of high-throughput sequencing (frequently termed next-generation sequencing (NGS)) data has created opportunities to gain deeper insights into the mechanisms of a number of diseases and is already impacting many areas of medicine and public health. The area of infectious diseases stands somewhat apart from other human diseases insofar as the relevant genomic data comes from the microbes rather than their human hosts. A particular concern about the threat of antimicrobial resistance (AMR) has driven the collection and reporting of large-scale datasets containing information from microbial genomes together with antimicrobial susceptibility test (AST) results. Unfortunately, the lack of clear standards or guiding principles for the reporting of such data is hampering the field's advancement. We therefore present our recommendations for the publication and sharing of genotype and phenotype data on AMR, in the form of 10 simple rules. The adoption of these recommendations will enhance AMR data interoperability and help enable its large-scale analyses using computational biology tools, including mathematical modelling and machine learning. We hope that these rules can shed light on often overlooked but nonetheless very necessary aspects of AMR data sharing and enhance the field's ability to address the problems of understanding AMR mechanisms, tracking their emergence and spread in populations, and predicting microbial susceptibility to antimicrobials for diagnostic purposes.


Assuntos
Antibacterianos , Anti-Infecciosos , Humanos , Antibacterianos/farmacologia , Farmacorresistência Bacteriana/genética , Bactérias/genética , Genoma Microbiano , Genótipo , Fenótipo
5.
Artigo em Inglês | MEDLINE | ID: mdl-37200133

RESUMO

An important problem in genome comparison is the genome sorting problem, that is, the problem of finding a sequence of basic operations that transforms one genome into another whose length (possibly weighted) equals the distance between them. These sequences are called optimal sorting scenarios. However, there is usually a large number of such scenarios, and a naïve algorithm is very likely to be biased towards a specific type of scenario, impairing its usefulness in real-world applications. One way to go beyond the traditional sorting algorithms is to explore all possible solutions, looking at all the optimal sorting scenarios instead of just an arbitrary one. Another related approach is to analyze all the intermediate genomes, that is, all the genomes that can occur in an optimal sorting scenario. In this paper, we show how to enumerate the optimal sorting scenarios and the intermediate genomes between any two given genomes, under the rank distance.

6.
Bioinformatics ; 39(3)2023 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-36790056

RESUMO

MOTIVATION: The rank distance model represents genome rearrangements in multi-chromosomal genomes as matrix operations, which allows the reconstruction of parsimonious histories of evolution by rearrangements. We seek to generalize this model by allowing for genomes with different gene content, to accommodate a broader range of biological contexts. We approach this generalization by using a matrix representation of genomes. This leads to simple distance formulas and sorting algorithms for genomes with different gene contents, but without duplications. RESULTS: We generalize the rank distance to genomes with different gene content in two different ways. The first approach adds insertions, deletions and the substitution of a single extremity to the basic operations. We show how to efficiently compute this distance. To avoid genomes with incomplete markers, our alternative distance, the rank-indel distance, only uses insertions and deletions of entire chromosomes. We construct phylogenetic trees with our distances and the DCJ-Indel distance for simulated data and real prokaryotic genomes, and compare them against reference trees. For simulated data, our distances outperform the DCJ-Indel distance using the Quartet metric as baseline. This suggests that rank distances are more robust for comparing distantly related species. For real prokaryotic genomes, all rearrangement-based distances yield phylogenetic trees that are topologically distant from the reference (65% similarity with Quartet metric), but are able to cluster related species within their respective clades and distinguish the Shigella strains as the farthest relative of the Escherichia coli strains, a feature not seen in the reference tree. AVAILABILITY AND IMPLEMENTATION: Code and instructions are available at https://github.com/meidanis-lab/rank-indel. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica , Modelos Genéticos , Filogenia , Genoma , Mutação INDEL , Algoritmos
7.
Comput Struct Biotechnol J ; 20: 4688-4703, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36147681

RESUMO

Antibiotic-resistant pathogens are a major public health threat. A deeper understanding of how an antibiotic's mechanism of action influences the emergence of resistance would aid in the design of new drugs and help to preserve the effectiveness of existing ones. To this end, we developed a model that links bacterial population dynamics with antibiotic-target binding kinetics. Our approach allows us to derive mechanistic insights on drug activity from population-scale experimental data and to quantify the interplay between drug mechanism and resistance selection. We find that both bacteriostatic and bactericidal agents can be equally effective at suppressing the selection of resistant mutants, but that key determinants of resistance selection are the relationships between the number of drug-inactivated targets within a cell and the rates of cellular growth and death. We also show that heterogeneous drug-target binding within a population enables resistant bacteria to evolve fitness-improving secondary mutations even when drug doses remain above the resistant strain's minimum inhibitory concentration. Our work suggests that antibiotic doses beyond this "secondary mutation selection window" could safeguard against the emergence of high-fitness resistant strains during treatment.

8.
Philos Trans R Soc Lond B Biol Sci ; 377(1861): 20210231, 2022 10 10.
Artigo em Inglês | MEDLINE | ID: mdl-35989604

RESUMO

The field of genomic epidemiology is rapidly growing as many jurisdictions begin to deploy whole-genome sequencing (WGS) in their national or regional pathogen surveillance programmes. WGS data offer a rich view of the shared ancestry of a set of taxa, typically visualized with phylogenetic trees illustrating the clusters or subtypes present in a group of taxa, their relatedness and the extent of diversification within and between them. When methicillin-resistant Staphylococcus aureus (MRSA) arose and disseminated widely, phylogenetic trees of MRSA-containing types of S. aureus had a distinctive 'comet' shape, with a 'comet head' of recently adapted drug-resistant isolates in the context of a 'comet tail' that was predominantly drug-sensitive. Placing an S. aureus isolate in the context of such a 'comet' helped public health laboratories interpret local data within the broader setting of S. aureus evolution. In this work, we ask what other tree shapes, analogous to the MRSA comet, are present in bacterial WGS datasets. We extract trees from large bacterial genomic datasets, visualize them as images and cluster the images. We find nine major groups of tree images, including the 'comets', star-like phylogenies, 'barbell' phylogenies and other shapes, and comment on the evolutionary and epidemiological stories these shapes might illustrate. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.


Assuntos
Staphylococcus aureus Resistente à Meticilina , Infecções Estafilocócicas , Antibacterianos , Análise por Conglomerados , Genoma Bacteriano , Humanos , Staphylococcus aureus Resistente à Meticilina/genética , Filogenia , Infecções Estafilocócicas/epidemiologia , Infecções Estafilocócicas/genética , Infecções Estafilocócicas/microbiologia , Staphylococcus aureus/genética
9.
Lancet Microbe ; 3(4): e265-e273, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-35373160

RESUMO

Background: Molecular diagnostics are considered the most promising route to achieving rapid, universal drug susceptibility testing for Mycobacterium tuberculosiscomplex (MTBC). We aimed to generate a WHO endorsed catalogue of mutations to serve as a global standard for interpreting molecular information for drug resistance prediction. Methods: A candidate gene approach was used to identify mutations as associated with resistance, or consistent with susceptibility, for 13 WHO endorsed anti-tuberculosis drugs. 38,215 MTBC isolates with paired whole-genome sequencing and phenotypic drug susceptibility testing data were amassed from 45 countries. For each mutation, a contingency table of binary phenotypes and presence or absence of the mutation computed positive predictive value, and Fisher's exact tests generated odds ratios and Benjamini-Hochberg corrected p-values. Mutations were graded as Associated with Resistance if present in at least 5 isolates, if the odds ratio was >1 with a statistically significant corrected p-value, and if the lower bound of the 95% confidence interval on the positive predictive value for phenotypic resistance was >25%. A series of expert rules were applied for final confidence grading of each mutation. Findings: 15,667 associations were computed for 13,211 unique mutations linked to one or more drugs. 1,149/15,667 (7·3%) mutations were classified as associated with phenotypic resistance and 107/15,667 (0·7%) were deemed consistent with susceptibility. For rifampicin, isoniazid, ethambutol, fluoroquinolones, and streptomycin, the mutations' pooled sensitivity was >80%. Specificity was over 95% for all drugs except ethionamide (91·4%), moxifloxacin (91·6%) and ethambutol (93·3%). Only two resistance mutations were classified for bedaquiline, delamanid, clofazimine, and linezolid as prevalence of phenotypic resistance was low for these drugs. Interpretation: This first WHO endorsed catalogue of molecular targets for MTBC drug susceptibility testing provides a global standard for resistance interpretation. Its existence should encourage the implementation of molecular diagnostics by National Tuberculosis Programmes. Funding: UNITAID, Wellcome, MRC, BMGF.


Assuntos
Etambutol , Mycobacterium tuberculosis , Antituberculosos/farmacologia , Resistência a Medicamentos , Testes de Sensibilidade Microbiana , Mutação , Mycobacterium tuberculosis/genética , Organização Mundial da Saúde
10.
BMC Bioinformatics ; 23(1): 42, 2022 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-35033007

RESUMO

BACKGROUND: There has been a simultaneous increase in demand and accessibility across genomics, transcriptomics, proteomics and metabolomics data, known as omics data. This has encouraged widespread application of omics data in life sciences, from personalized medicine to the discovery of underlying pathophysiology of diseases. Causal analysis of omics data may provide important insight into the underlying biological mechanisms. Existing causal analysis methods yield promising results when identifying potential general causes of an observed outcome based on omics data. However, they may fail to discover the causes specific to a particular stratum of individuals and missing from others. METHODS: To fill this gap, we introduce the problem of stratified causal discovery and propose a method, Aristotle, for solving it. Aristotle addresses the two challenges intrinsic to omics data: high dimensionality and hidden stratification. It employs existing biological knowledge and a state-of-the-art patient stratification method to tackle the above challenges and applies a quasi-experimental design method to each stratum to find stratum-specific potential causes. RESULTS: Evaluation based on synthetic data shows better performance for Aristotle in discovering true causes under different conditions compared to existing causal discovery methods. Experiments on a real dataset on Anthracycline Cardiotoxicity indicate that Aristotle's predictions are consistent with the existing literature. Moreover, Aristotle makes additional predictions that suggest further investigations.


Assuntos
Genômica , Proteômica , Humanos , Metabolômica , Medicina de Precisão , Transcriptoma
11.
PLoS One ; 16(12): e0259877, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34941890

RESUMO

The shape of phylogenetic trees can be used to gain evolutionary insights. A tree's shape specifies the connectivity of a tree, while its branch lengths reflect either the time or genetic distance between branching events; well-known measures of tree shape include the Colless and Sackin imbalance, which describe the asymmetry of a tree. In other contexts, network science has become an important paradigm for describing structural features of networks and using them to understand complex systems, ranging from protein interactions to social systems. Network science is thus a potential source of many novel ways to characterize tree shape, as trees are also networks. Here, we tailor tools from network science, including diameter, average path length, and betweenness, closeness, and eigenvector centrality, to summarize phylogenetic tree shapes. We thereby propose tree shape summaries that are complementary to both asymmetry and the frequencies of small configurations. These new statistics can be computed in linear time and scale well to describe the shapes of large trees. We apply these statistics, alongside some conventional tree statistics, to phylogenetic trees from three very different viruses (HIV, dengue fever and measles), from the same virus in different epidemiological scenarios (influenza A and HIV) and from simulation models known to produce trees with different shapes. Using mutual information and supervised learning algorithms, we find that the statistics adapted from network science perform as well as or better than conventional statistics. We describe their distributions and prove some basic results about their extreme values in a tree. We conclude that network science-based tree shape summaries are a promising addition to the toolkit of tree shape features. All our shape summaries, as well as functions to select the most discriminating ones for two sets of trees, are freely available as an R package at http://github.com/Leonardini/treeCentrality.


Assuntos
Biologia Computacional/métodos , Árvores de Decisões , Viroses/virologia , Vírus/classificação , Algoritmos , Interpretação Estatística de Dados , Dengue/epidemiologia , Dengue/virologia , Vírus da Dengue/classificação , Infecções por HIV/epidemiologia , Infecções por HIV/virologia , HIV-1/classificação , Humanos , Sarampo/epidemiologia , Sarampo/virologia , Vírus do Sarampo/classificação , Filogenia , Software , Viroses/epidemiologia
12.
Nat Commun ; 12(1): 5820, 2021 10 05.
Artigo em Inglês | MEDLINE | ID: mdl-34611158

RESUMO

European governments use non-pharmaceutical interventions (NPIs) to control resurging waves of COVID-19. However, they only have outdated estimates for how effective individual NPIs were in the first wave. We estimate the effectiveness of 17 NPIs in Europe's second wave from subnational case and death data by introducing a flexible hierarchical Bayesian transmission model and collecting the largest dataset of NPI implementation dates across Europe. Business closures, educational institution closures, and gathering bans reduced transmission, but reduced it less than they did in the first wave. This difference is likely due to organisational safety measures and individual protective behaviours-such as distancing-which made various areas of public life safer and thereby reduced the effect of closing them. Specifically, we find smaller effects for closing educational institutions, suggesting that stringent safety measures made schools safer compared to the first wave. Second-wave estimates outperform previous estimates at predicting transmission in Europe's third wave.


Assuntos
COVID-19/epidemiologia , Governo , Número Básico de Reprodução , COVID-19/virologia , Europa (Continente)/epidemiologia , Humanos , Modelos Teóricos , SARS-CoV-2/fisiologia , Fatores de Tempo
13.
Algorithms Mol Biol ; 16(1): 17, 2021 Aug 10.
Artigo em Inglês | MEDLINE | ID: mdl-34376217

RESUMO

MOTIVATION: Prediction of drug resistance and identification of its mechanisms in bacteria such as Mycobacterium tuberculosis, the etiological agent of tuberculosis, is a challenging problem. Solving this problem requires a transparent, accurate, and flexible predictive model. The methods currently used for this purpose rarely satisfy all of these criteria. On the one hand, approaches based on testing strains against a catalogue of previously identified mutations often yield poor predictive performance; on the other hand, machine learning techniques typically have higher predictive accuracy, but often lack interpretability and may learn patterns that produce accurate predictions for the wrong reasons. Current interpretable methods may either exhibit a lower accuracy or lack the flexibility needed to generalize them to previously unseen data. CONTRIBUTION: In this paper we propose a novel technique, inspired by group testing and Boolean compressed sensing, which yields highly accurate predictions, interpretable results, and is flexible enough to be optimized for various evaluation metrics at the same time. RESULTS: We test the predictive accuracy of our approach on five first-line and seven second-line antibiotics used for treating tuberculosis. We find that it has a higher or comparable accuracy to that of commonly used machine learning models, and is able to identify variants in genes with previously reported association to drug resistance. Our method is intrinsically interpretable, and can be customized for different evaluation metrics. Our implementation is available at github.com/hoomanzabeti/INGOT_DR and can be installed via The Python Package Index (Pypi) under ingotdr. This package is also compatible with most of the tools in the Scikit-learn machine learning library.

15.
Science ; 371(6531)2021 02 19.
Artigo em Inglês | MEDLINE | ID: mdl-33323424

RESUMO

Governments are attempting to control the COVID-19 pandemic with nonpharmaceutical interventions (NPIs). However, the effectiveness of different NPIs at reducing transmission is poorly understood. We gathered chronological data on the implementation of NPIs for several European and non-European countries between January and the end of May 2020. We estimated the effectiveness of these NPIs, which range from limiting gathering sizes and closing businesses or educational institutions to stay-at-home orders. To do so, we used a Bayesian hierarchical model that links NPI implementation dates to national case and death counts and supported the results with extensive empirical validation. Closing all educational institutions, limiting gatherings to 10 people or less, and closing face-to-face businesses each reduced transmission considerably. The additional effect of stay-at-home orders was comparatively small.


Assuntos
COVID-19/prevenção & controle , Controle de Doenças Transmissíveis , Governo , Ásia/epidemiologia , Teorema de Bayes , COVID-19/transmissão , Comércio , Europa (Continente)/epidemiologia , Política de Saúde , Humanos , Modelos Teóricos , Pandemias/prevenção & controle , Distanciamento Físico , Instituições Acadêmicas , Universidades
16.
Comput Biol Chem ; 87: 107284, 2020 May 19.
Artigo em Inglês | MEDLINE | ID: mdl-32599459

RESUMO

With the exponential growth of genome databases, the importance of phylogenetics has increased dramatically over the past years. Studying phylogenetic trees enables us not only to understand how genes, genomes, and species evolve, but also helps us predict how they might change in future. One of the crucial aspects of phylogenetics is the comparison of two or more phylogenetic trees. There are different metrics for computing the dissimilarity between a pair of trees. The Robinson-Foulds (RF) distance is one of the widely used metrics on the space of labeled trees. The distribution of the RF distance from a given tree has been studied before, but the fastest known algorithm for computing this distribution is a slow, albeit polynomial-time, O(l5) algorithm. In this paper, we modify the dynamic programming algorithm for computing the distribution of this distance for a given tree by leveraging the number-theoretic transform (NTT), and improve the running time from O(l5) to O(l3logl), where l is the number of tips of the tree. In addition to its practical usefulness, our method represents a theoretical novelty, as it is, to our knowledge, one of the rare applications of the number-theoretic transform for solving a computational biology problem.

18.
BMC Bioinformatics ; 20(Suppl 20): 637, 2019 Dec 17.
Artigo em Inglês | MEDLINE | ID: mdl-31842753

RESUMO

BACKGROUND: Bacterial pathogens exhibit an impressive amount of genomic diversity. This diversity can be informative of evolutionary adaptations, host-pathogen interactions, and disease transmission patterns. However, capturing this diversity directly from biological samples is challenging. RESULTS: We introduce a framework for understanding the within-host diversity of a pathogen using multi-locus sequence types (MLST) from whole-genome sequencing (WGS) data. Our approach consists of two stages. First we process each sample individually by assigning it, for each locus in the MLST scheme, a set of alleles and a proportion for each allele. Next, we associate to each sample a set of strain types using the alleles and the strain proportions obtained in the first step. We achieve this by using the smallest possible number of previously unobserved strains across all samples, while using those unobserved strains which are as close to the observed ones as possible, at the same time respecting the allele proportions as closely as possible. We solve both problems using mixed integer linear programming (MILP). Our method performs accurately on simulated data and generates results on a real data set of Borrelia burgdorferi genomes suggesting a high level of diversity for this pathogen. CONCLUSIONS: Our approach can apply to any bacterial pathogen with an MLST scheme, even though we developed it with Borrelia burgdorferi, the etiological agent of Lyme disease, in mind. Our work paves the way for robust strain typing in the presence of within-host heterogeneity, overcoming an essential challenge currently not addressed by any existing methodology for pathogen genomics.


Assuntos
Variação Genética , Interações Hospedeiro-Patógeno/genética , Tipagem de Sequências Multilocus , Alelos , Borrelia burgdorferi/genética , Simulação por Computador , Bases de Dados Genéticas , Loci Gênicos , Modelos Biológicos
19.
Algorithms Mol Biol ; 14: 16, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31832081

RESUMO

BACKGROUND: The area of genome rearrangements has given rise to a number of interesting biological, mathematical and algorithmic problems. Among these, one of the most intractable ones has been that of finding the median of three genomes, a special case of the ancestral reconstruction problem. In this work we re-examine our recently proposed way of measuring genome rearrangement distance, namely, the rank distance between the matrix representations of the corresponding genomes, and show that the median of three genomes can be computed exactly in polynomial time O ( n ω ) , where ω ≤ 3 , with respect to this distance, when the median is allowed to be an arbitrary orthogonal matrix. RESULTS: We define the five fundamental subspaces depending on three input genomes, and use their properties to show that a particular action on each of these subspaces produces a median. In the process we introduce the notion of M-stable subspaces. We also show that the median found by our algorithm is always orthogonal, symmetric, and conserves any adjacencies or telomeres present in at least 2 out of 3 input genomes. CONCLUSIONS: We test our method on both simulated and real data. We find that the majority of the realistic inputs result in genomic outputs, and for those that do not, our two heuristics perform well in terms of reconstructing a genomic matrix attaining a score close to the lower bound, while running in a reasonable amount of time. We conclude that the rank distance is not only theoretically intriguing, but also practically useful for median-finding, and potentially ancestral genome reconstruction.

20.
PLoS One ; 14(11): e0224197, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31751352

RESUMO

Phylogenetic trees are frequently used in biology to study the relationships between a number of species or organisms. The shape of a phylogenetic tree contains useful information about patterns of speciation and extinction, so powerful tools are needed to investigate the shape of a phylogenetic tree. Tree shape statistics are a common approach to quantifying the shape of a phylogenetic tree by encoding it with a single number. In this article, we propose a new resolution function to evaluate the power of different tree shape statistics to distinguish between dissimilar trees. We show that the new resolution function requires less time and space in comparison with the previously proposed resolution function for tree shape statistics. We also introduce a new class of tree shape statistics, which are linear combinations of two existing statistics that are optimal with respect to a resolution function, and show evidence that the statistics in this class converge to a limiting linear combination as the size of the tree increases. Our implementation is freely available at https://github.com/WGS-TB/TreeShapeStats.


Assuntos
Biologia Computacional/métodos , Modelos Genéticos , Filogenia , Interpretação Estatística de Dados
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...