Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 48
Filtrar
1.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37871178

RESUMO

SUMMARY: Fastlin is a bioinformatics tool designed for rapid Mycobacterium tuberculosis complex (MTBC) lineage typing. It utilizes an ultra-fast alignment-free approach to detect previously identified barcode single nucleotide polymorphisms associated with specific MTBC lineages. In a comprehensive benchmarking against existing tools, fastlin demonstrated high accuracy and significantly faster running times. AVAILABILITY AND IMPLEMENTATION: fastlin is freely available at https://github.com/rderelle/fastlin and can easily be installed using Conda.


Assuntos
Mycobacterium tuberculosis , Mycobacterium tuberculosis/genética , Biologia Computacional , Polimorfismo de Nucleotídeo Único , Software
2.
Bioinformatics ; 39(3)2023 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-36790056

RESUMO

MOTIVATION: The rank distance model represents genome rearrangements in multi-chromosomal genomes as matrix operations, which allows the reconstruction of parsimonious histories of evolution by rearrangements. We seek to generalize this model by allowing for genomes with different gene content, to accommodate a broader range of biological contexts. We approach this generalization by using a matrix representation of genomes. This leads to simple distance formulas and sorting algorithms for genomes with different gene contents, but without duplications. RESULTS: We generalize the rank distance to genomes with different gene content in two different ways. The first approach adds insertions, deletions and the substitution of a single extremity to the basic operations. We show how to efficiently compute this distance. To avoid genomes with incomplete markers, our alternative distance, the rank-indel distance, only uses insertions and deletions of entire chromosomes. We construct phylogenetic trees with our distances and the DCJ-Indel distance for simulated data and real prokaryotic genomes, and compare them against reference trees. For simulated data, our distances outperform the DCJ-Indel distance using the Quartet metric as baseline. This suggests that rank distances are more robust for comparing distantly related species. For real prokaryotic genomes, all rearrangement-based distances yield phylogenetic trees that are topologically distant from the reference (65% similarity with Quartet metric), but are able to cluster related species within their respective clades and distinguish the Shigella strains as the farthest relative of the Escherichia coli strains, a feature not seen in the reference tree. AVAILABILITY AND IMPLEMENTATION: Code and instructions are available at https://github.com/meidanis-lab/rank-indel. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica , Modelos Genéticos , Filogenia , Genoma , Mutação INDEL , Algoritmos
3.
PLoS Comput Biol ; 19(6): e1011129, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37347768

RESUMO

The increasing availability of high-throughput sequencing (frequently termed next-generation sequencing (NGS)) data has created opportunities to gain deeper insights into the mechanisms of a number of diseases and is already impacting many areas of medicine and public health. The area of infectious diseases stands somewhat apart from other human diseases insofar as the relevant genomic data comes from the microbes rather than their human hosts. A particular concern about the threat of antimicrobial resistance (AMR) has driven the collection and reporting of large-scale datasets containing information from microbial genomes together with antimicrobial susceptibility test (AST) results. Unfortunately, the lack of clear standards or guiding principles for the reporting of such data is hampering the field's advancement. We therefore present our recommendations for the publication and sharing of genotype and phenotype data on AMR, in the form of 10 simple rules. The adoption of these recommendations will enhance AMR data interoperability and help enable its large-scale analyses using computational biology tools, including mathematical modelling and machine learning. We hope that these rules can shed light on often overlooked but nonetheless very necessary aspects of AMR data sharing and enhance the field's ability to address the problems of understanding AMR mechanisms, tracking their emergence and spread in populations, and predicting microbial susceptibility to antimicrobials for diagnostic purposes.


Assuntos
Antibacterianos , Anti-Infecciosos , Humanos , Antibacterianos/farmacologia , Farmacorresistência Bacteriana/genética , Bactérias/genética , Genoma Microbiano , Genótipo , Fenótipo
4.
BMC Bioinformatics ; 23(1): 42, 2022 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-35033007

RESUMO

BACKGROUND: There has been a simultaneous increase in demand and accessibility across genomics, transcriptomics, proteomics and metabolomics data, known as omics data. This has encouraged widespread application of omics data in life sciences, from personalized medicine to the discovery of underlying pathophysiology of diseases. Causal analysis of omics data may provide important insight into the underlying biological mechanisms. Existing causal analysis methods yield promising results when identifying potential general causes of an observed outcome based on omics data. However, they may fail to discover the causes specific to a particular stratum of individuals and missing from others. METHODS: To fill this gap, we introduce the problem of stratified causal discovery and propose a method, Aristotle, for solving it. Aristotle addresses the two challenges intrinsic to omics data: high dimensionality and hidden stratification. It employs existing biological knowledge and a state-of-the-art patient stratification method to tackle the above challenges and applies a quasi-experimental design method to each stratum to find stratum-specific potential causes. RESULTS: Evaluation based on synthetic data shows better performance for Aristotle in discovering true causes under different conditions compared to existing causal discovery methods. Experiments on a real dataset on Anthracycline Cardiotoxicity indicate that Aristotle's predictions are consistent with the existing literature. Moreover, Aristotle makes additional predictions that suggest further investigations.


Assuntos
Genômica , Proteômica , Humanos , Metabolômica , Medicina de Precisão , Transcriptoma
5.
Bioinformatics ; 35(14): i615-i623, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31510702

RESUMO

MOTIVATION: Constraint-based modeling of metabolic networks helps researchers gain insight into the metabolic processes of many organisms, both prokaryotic and eukaryotic. Minimal cut sets (MCSs) are minimal sets of reactions whose inhibition blocks a target reaction in a metabolic network. Most approaches for finding the MCSs in constrained-based models require, either as an intermediate step or as a byproduct of the calculation, the computation of the set of elementary flux modes (EFMs), a convex basis for the valid flux vectors in the network. Recently, Ballerstein et al. proposed a method for computing the MCSs of a network without first computing its EFMs, by creating a dual network whose EFMs are a superset of the MCSs of the original network. However, their dual network is always larger than the original network and depends on the target reaction. Here we propose the construction of a different dual network, which is typically smaller than the original network and is independent of the target reaction, for the same purpose. We prove the correctness of our approach, minimal coordinated support (MCS2), and describe how it can be modified to compute the few smallest MCSs for a given target reaction. RESULTS: We compare MCS2 to the method of Ballerstein et al. and two other existing methods. We show that MCS2 succeeds in calculating the full set of MCSs in many models where other approaches cannot finish within a reasonable amount of time. Thus, in addition to its theoretical novelty, our approach provides a practical advantage over existing methods. AVAILABILITY AND IMPLEMENTATION: MCS2 is freely available at https://github.com/RezaMash/MCS under the GNU 3.0 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Redes e Vias Metabólicas , Modelos Biológicos
6.
Bioinformatics ; 35(14): i379-i388, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31510674

RESUMO

MOTIVATION: Despite the remarkable advances in sequencing and computational techniques, noise in the data and complexity of the underlying biological mechanisms render deconvolution of the phylogenetic relationships between cancer mutations difficult. Besides that, the majority of the existing datasets consist of bulk sequencing data of single tumor sample of an individual. Accurate inference of the phylogenetic order of mutations is particularly challenging in these cases and the existing methods are faced with several theoretical limitations. To overcome these limitations, new methods are required for integrating and harnessing the full potential of the existing data. RESULTS: We introduce a method called Hintra for intra-tumor heterogeneity detection. Hintra integrates sequencing data for a cohort of tumors and infers tumor phylogeny for each individual based on the evolutionary information shared between different tumors. Through an iterative process, Hintra learns the repeating evolutionary patterns and uses this information for resolving the phylogenetic ambiguities of individual tumors. The results of synthetic experiments show an improved performance compared to two state-of-the-art methods. The experimental results with a recent Breast Cancer dataset are consistent with the existing knowledge and provide potentially interesting findings. AVAILABILITY AND IMPLEMENTATION: The source code for Hintra is available at https://github.com/sahandk/HINTRA.


Assuntos
Neoplasias , Software , Humanos , Mutação , Filogenia , Análise de Sequência
7.
BMC Bioinformatics ; 20(Suppl 20): 637, 2019 Dec 17.
Artigo em Inglês | MEDLINE | ID: mdl-31842753

RESUMO

BACKGROUND: Bacterial pathogens exhibit an impressive amount of genomic diversity. This diversity can be informative of evolutionary adaptations, host-pathogen interactions, and disease transmission patterns. However, capturing this diversity directly from biological samples is challenging. RESULTS: We introduce a framework for understanding the within-host diversity of a pathogen using multi-locus sequence types (MLST) from whole-genome sequencing (WGS) data. Our approach consists of two stages. First we process each sample individually by assigning it, for each locus in the MLST scheme, a set of alleles and a proportion for each allele. Next, we associate to each sample a set of strain types using the alleles and the strain proportions obtained in the first step. We achieve this by using the smallest possible number of previously unobserved strains across all samples, while using those unobserved strains which are as close to the observed ones as possible, at the same time respecting the allele proportions as closely as possible. We solve both problems using mixed integer linear programming (MILP). Our method performs accurately on simulated data and generates results on a real data set of Borrelia burgdorferi genomes suggesting a high level of diversity for this pathogen. CONCLUSIONS: Our approach can apply to any bacterial pathogen with an MLST scheme, even though we developed it with Borrelia burgdorferi, the etiological agent of Lyme disease, in mind. Our work paves the way for robust strain typing in the presence of within-host heterogeneity, overcoming an essential challenge currently not addressed by any existing methodology for pathogen genomics.


Assuntos
Variação Genética , Interações Hospedeiro-Patógeno/genética , Tipagem de Sequências Multilocus , Alelos , Borrelia burgdorferi/genética , Simulação por Computador , Bases de Dados Genéticas , Loci Gênicos , Modelos Biológicos
8.
BMC Bioinformatics ; 19(Suppl 6): 142, 2018 05 08.
Artigo em Inglês | MEDLINE | ID: mdl-29745865

RESUMO

BACKGROUND: Recently, Pereira Zanetti, Biller and Meidanis have proposed a new definition of a rearrangement distance between genomes. In this formulation, each genome is represented as a matrix, and the distance d is the rank distance between these matrices. Although defined in terms of matrices, the rank distance is equal to the minimum total weight of a series of weighted operations that leads from one genome to the other, including inversions, translocations, transpositions, and others. The computational complexity of the median-of-three problem according to this distance is currently unknown. The genome matrices are a special kind of permutation matrices, which we study in this paper. In their paper, the authors provide an [Formula: see text] algorithm for determining three candidate medians, prove the tight approximation ratio [Formula: see text], and provide a sufficient condition for their candidates to be true medians. They also conduct some experiments that suggest that their method is accurate on simulated and real data. RESULTS: In this paper, we extend their results and provide the following: Three invariants characterizing the problem of finding the median of 3 matrices A sufficient condition for uniqueness of medians that can be checked in O(n) A faster, [Formula: see text] algorithm for determining the median under this condition A new heuristic algorithm for this problem based on compressed sensing A [Formula: see text] algorithm that exactly solves the problem when the inputs are orthogonal matrices, a class that includes both permutations and genomes as special cases. CONCLUSIONS: Our work provides the first proof that, with respect to the rank distance, the problem of finding the median of 3 genomes, as well as the median of 3 permutations, is exactly solvable in polynomial time, a result which should be contrasted with its NP-hardness for the DCJ (double cut-and-join) distance and most other families of genome rearrangement operations. This result, backed by our experimental tests, indicates that the rank distance is a viable alternative to the DCJ distance widely used in genome comparisons.


Assuntos
Modelos Genéticos , Algoritmos , Simulação por Computador , Bases de Dados Genéticas , Rearranjo Gênico , Genoma , Genômica/métodos , Mutação/genética
9.
Emerg Infect Dis ; 23(11): 1887-1890, 2017 11.
Artigo em Inglês | MEDLINE | ID: mdl-29048297

RESUMO

Because within-host Mycobacterium tuberculosis diversity complicates diagnosis and treatment of tuberculosis (TB), we measured diversity prevalence and associated factors among 3,098 pulmonary TB patients in Lima, Peru. The 161 patients with polyclonal infection were more likely than the 115 with clonal or the 2,822 with simple infections to have multidrug-resistant TB.


Assuntos
Mycobacterium tuberculosis/genética , Tuberculose Resistente a Múltiplos Medicamentos , Tuberculose Pulmonar/microbiologia , Adolescente , Adulto , Estudos de Coortes , Feminino , Variação Genética , Humanos , Masculino , Pessoa de Meia-Idade , Mycobacterium tuberculosis/classificação , Mycobacterium tuberculosis/efeitos dos fármacos , Mycobacterium tuberculosis/isolamento & purificação , Peru/epidemiologia , Prevalência , Risco , Tuberculose Pulmonar/epidemiologia , Adulto Jovem
10.
Eur Respir J ; 50(6)2017 12.
Artigo em Inglês | MEDLINE | ID: mdl-29284687

RESUMO

A clear understanding of the genetic basis of antibiotic resistance in Mycobacterium tuberculosis is required to accelerate the development of rapid drug susceptibility testing methods based on genetic sequence.Raw genotype-phenotype correlation data were extracted as part of a comprehensive systematic review to develop a standardised analytical approach for interpreting resistance associated mutations for rifampicin, isoniazid, ofloxacin/levofloxacin, moxifloxacin, amikacin, kanamycin, capreomycin, streptomycin, ethionamide/prothionamide and pyrazinamide. Mutation frequencies in resistant and susceptible isolates were calculated, together with novel statistical measures to classify mutations as high, moderate, minimal or indeterminate confidence for predicting resistance.We identified 286 confidence-graded mutations associated with resistance. Compared to phenotypic methods, sensitivity (95% CI) for rifampicin was 90.3% (89.6-90.9%), while for isoniazid it was 78.2% (77.4-79.0%) and their specificities were 96.3% (95.7-96.8%) and 94.4% (93.1-95.5%), respectively. For second-line drugs, sensitivity varied from 67.4% (64.1-70.6%) for capreomycin to 88.2% (85.1-90.9%) for moxifloxacin, with specificity ranging from 90.0% (87.1-92.5%) for moxifloxacin to 99.5% (99.0-99.8%) for amikacin.This study provides a standardised and comprehensive approach for the interpretation of mutations as predictors of M. tuberculosis drug-resistant phenotypes. These data have implications for the clinical interpretation of molecular diagnostics and next-generation sequencing as well as efficient individualised therapy for patients with drug-resistant tuberculosis.


Assuntos
Antituberculosos/farmacologia , Interpretação Estatística de Dados , Farmacorresistência Bacteriana Múltipla/genética , Mycobacterium tuberculosis/genética , Tuberculose Resistente a Múltiplos Medicamentos/diagnóstico , Proteínas de Bactérias/genética , DNA Bacteriano/genética , Genótipo , Humanos , Testes de Sensibilidade Microbiana , Mutação , Mycobacterium tuberculosis/efeitos dos fármacos , Fenótipo , Análise de Sequência de DNA , Revisões Sistemáticas como Assunto , Tuberculose Resistente a Múltiplos Medicamentos/microbiologia
11.
PLoS Comput Biol ; 12(2): e1004475, 2016 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-26829497

RESUMO

Genomic tools have revealed genetically diverse pathogens within some hosts. Within-host pathogen diversity, which we refer to as "complex infection", is increasingly recognized as a determinant of treatment outcome for infections like tuberculosis. Complex infection arises through two mechanisms: within-host mutation (which results in clonal heterogeneity) and reinfection (which results in mixed infections). Estimates of the frequency of within-host mutation and reinfection in populations are critical for understanding the natural history of disease. These estimates influence projections of disease trends and effects of interventions. The genotyping technique MLVA (multiple loci variable-number tandem repeats analysis) can identify complex infections, but the current method to distinguish clonal heterogeneity from mixed infections is based on a rather simple rule. Here we describe ClassTR, a method which leverages MLVA information from isolates collected in a population to distinguish mixed infections from clonal heterogeneity. We formulate the resolution of complex infections into their constituent strains as an optimization problem, and show its NP-completeness. We solve it efficiently by using mixed integer linear programming and graph decomposition. Once the complex infections are resolved into their constituent strains, ClassTR probabilistically classifies isolates as clonally heterogeneous or mixed by using a model of tandem repeat evolution. We first compare ClassTR with the standard rule-based classification on 100 simulated datasets. ClassTR outperforms the standard method, improving classification accuracy from 48% to 80%. We then apply ClassTR to a sample of 436 strains collected from tuberculosis patients in a South African community, of which 92 had complex infections. We find that ClassTR assigns an alternate classification to 18 of the 92 complex infections, suggesting important differences in practice. By explicitly modeling tandem repeat evolution, ClassTR helps to improve our understanding of the mechanisms driving within-host diversity of pathogens like Mycobacterium tuberculosis.


Assuntos
Repetições Minissatélites/genética , Mycobacterium tuberculosis/genética , Tuberculose/microbiologia , Algoritmos , Biologia Computacional , Bases de Dados Genéticas , Humanos , Reprodutibilidade dos Testes
12.
J Infect Dis ; 213(11): 1796-9, 2016 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-26768249

RESUMO

The clinical management of tuberculosis is a major challenge in southern Africa. The prevalence of within-host genetically heterogeneous Mycobacterium tuberculosis infection and its effect on treatment response are not well understood. We enrolled 500 patients with tuberculosis in KwaZulu-Natal and followed them through 2 months of treatment. Using mycobacterial interspersed repetitive units-variable number of tandem repeats genotyping to identify mycobacterial heterogeneity, we report the prevalence and evaluate the association of heterogeneity with treatment response. Upon initiation of treatment, 21.1% of participants harbored a heterogeneous M. tuberculosis infection; such heterogeneity was independently associated with a nearly 2-fold higher odds of persistent culture positivity after 2 months of treatment (adjusted odds ratio, 1.90; 95% confidence interval, 1.03-3.50).


Assuntos
Antituberculosos/uso terapêutico , Heterogeneidade Genética , Mycobacterium tuberculosis/genética , Tuberculose Pulmonar/microbiologia , Adulto , Estudos de Coortes , Feminino , Seguimentos , Infecções por HIV/complicações , Humanos , Masculino , Estudos Prospectivos , África do Sul , Escarro/microbiologia , Tempo para o Tratamento , Tuberculose Pulmonar/complicações , Tuberculose Pulmonar/tratamento farmacológico
13.
Bioinformatics ; 29(21): 2765-73, 2013 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-24048352

RESUMO

MOTIVATION: The global alignment of protein interaction networks is a widely studied problem. It is an important first step in understanding the relationship between the proteins in different species and identifying functional orthologs. Furthermore, it can provide useful insights into the species' evolution. RESULTS: We propose a novel algorithm, PISwap, for optimizing global pairwise alignments of protein interaction networks, based on a local optimization heuristic that has previously demonstrated its effectiveness for a variety of other intractable problems. PISwap can begin with different types of network alignment approaches and then iteratively adjust the initial alignments by incorporating network topology information, trading it off for sequence information. In practice, our algorithm efficiently refines other well-studied alignment techniques with almost no additional time cost. We also show the robustness of the algorithm to noise in protein interaction data. In addition, the flexible nature of this algorithm makes it suitable for different applications of network alignment. This algorithm can yield interesting insights into the evolutionary dynamics of related species. AVAILABILITY: Our software is freely available for non-commercial purposes from our Web site, http://piswap.csail.mit.edu/. CONTACT: bab@csail.mit.edu or csliao@ie.nthu.edu.tw. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Mapeamento de Interação de Proteínas/métodos , Animais , Evolução Biológica , Humanos , Proteínas/química , Software
14.
Microb Genom ; 10(3)2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38529944

RESUMO

Minimum Inhibitory Concentrations (MICs) are the gold standard for quantitatively measuring antibiotic resistance. However, lab-based MIC determination can be time-consuming and suffers from low reproducibility, and interpretation as sensitive or resistant relies on guidelines which change over time. Genome sequencing and machine learning promise to allow in silico MIC prediction as an alternative approach which overcomes some of these difficulties, albeit the interpretation of MIC is still needed. Nevertheless, precisely how we should handle MIC data when dealing with predictive models remains unclear, since they are measured semi-quantitatively, with varying resolution, and are typically also left- and right-censored within varying ranges. We therefore investigated genome-based prediction of MICs in the pathogen Klebsiella pneumoniae using 4367 genomes with both simulated semi-quantitative traits and real MICs. As we were focused on clinical interpretation, we used interpretable rather than black-box machine learning models, namely, Elastic Net, Random Forests, and linear mixed models. Simulated traits were generated accounting for oligogenic, polygenic, and homoplastic genetic effects with different levels of heritability. Then we assessed how model prediction accuracy was affected when MICs were framed as regression and classification. Our results showed that treating the MICs differently depending on the number of concentration levels of antibiotic available was the most promising learning strategy. Specifically, to optimise both prediction accuracy and inference of the correct causal variants, we recommend considering the MICs as continuous and framing the learning problem as a regression when the number of observed antibiotic concentration levels is large, whereas with a smaller number of concentration levels they should be treated as a categorical variable and the learning problem should be framed as a classification. Our findings also underline how predictive models can be improved when prior biological knowledge is taken into account, due to the varying genetic architecture of each antibiotic resistance trait. Finally, we emphasise that incrementing the population database is pivotal for the future clinical implementation of these models to support routine machine-learning based diagnostics.


Assuntos
Antibacterianos , Klebsiella pneumoniae , Klebsiella pneumoniae/genética , Reprodutibilidade dos Testes , Antibacterianos/farmacologia , Aprendizado de Máquina , Testes de Sensibilidade Microbiana
15.
Microb Genom ; 10(6)2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38860884

RESUMO

As public health laboratories expand their genomic sequencing and bioinformatics capacity for the surveillance of different pathogens, labs must carry out robust validation, training, and optimization of wet- and dry-lab procedures. Achieving these goals for algorithms, pipelines and instruments often requires that lower quality datasets be made available for analysis and comparison alongside those of higher quality. This range of data quality in reference sets can complicate the sharing of sub-optimal datasets that are vital for the community and for the reproducibility of assays. Sharing of useful, but sub-optimal datasets requires careful annotation and documentation of known issues to enable appropriate interpretation, avoid being mistaken for better quality information, and for these data (and their derivatives) to be easily identifiable in repositories. Unfortunately, there are currently no standardized attributes or mechanisms for tagging poor-quality datasets, or datasets generated for a specific purpose, to maximize their utility, searchability, accessibility and reuse. The Public Health Alliance for Genomic Epidemiology (PHA4GE) is an international community of scientists from public health, industry and academia focused on improving the reproducibility, interoperability, portability, and openness of public health bioinformatic software, skills, tools and data. To address the challenges of sharing lower quality datasets, PHA4GE has developed a set of standardized contextual data tags, namely fields and terms, that can be included in public repository submissions as a means of flagging pathogen sequence data with known quality issues, increasing their discoverability. The contextual data tags were developed through consultations with the community including input from the International Nucleotide Sequence Data Collaboration (INSDC), and have been standardized using ontologies - community-based resources for defining the tag properties and the relationships between them. The standardized tags are agnostic to the organism and the sequencing technique used and thus can be applied to data generated from any pathogen using an array of sequencing techniques. The tags can also be applied to synthetic (lab created) data. The list of standardized tags is maintained by PHA4GE and can be found at https://github.com/pha4ge/contextual_data_QC_tags. Definitions, ontology IDs, examples of use, as well as a JSON representation, are provided. The PHA4GE QC tags were tested, and are now implemented, by the FDA's GenomeTrakr laboratory network as part of its routine submission process for SARS-CoV-2 wastewater surveillance. We hope that these simple, standardized tags will help improve communication regarding quality control in public repositories, in addition to making datasets of variable quality more easily identifiable. Suggestions for additional tags can be submitted to PHA4GE via the New Term Request Form in the GitHub repository. By providing a mechanism for feedback and suggestions, we also expect that the tags will evolve with the needs of the community.


Assuntos
Biologia Computacional , Saúde Pública , Controle de Qualidade , Humanos , Biologia Computacional/métodos , Disseminação de Informação/métodos , Reprodutibilidade dos Testes , Anotação de Sequência Molecular/métodos , Genômica/métodos , Software
16.
Bioinformatics ; 28(8): 1114-21, 2012 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-22355083

RESUMO

MOTIVATION: The interpretation of high-throughput datasets has remained one of the central challenges of computational biology over the past decade. Furthermore, as the amount of biological knowledge increases, it becomes more and more difficult to integrate this large body of knowledge in a meaningful manner. In this article, we propose a particular solution to both of these challenges. METHODS: We integrate available biological knowledge by constructing a network of molecular interactions of a specific kind: causal interactions. The resulting causal graph can be queried to suggest molecular hypotheses that explain the variations observed in a high-throughput gene expression experiment. We show that a simple scoring function can discriminate between a large number of competing molecular hypotheses about the upstream cause of the changes observed in a gene expression profile. We then develop an analytical method for computing the statistical significance of each score. This analytical method also helps assess the effects of random or adversarial noise on the predictive power of our model. RESULTS: Our results show that the causal graph we constructed from known biological literature is extremely robust to random noise and to missing or spurious information. We demonstrate the power of our causal reasoning model on two specific examples, one from a cancer dataset and the other from a cardiac hypertrophy experiment. We conclude that causal reasoning models provide a valuable addition to the biologist's toolkit for the interpretation of gene expression data. AVAILABILITY AND IMPLEMENTATION: R source code for the method is available upon request.


Assuntos
Neoplasias da Mama/genética , Cardiomegalia/genética , Biologia Computacional/métodos , Perfilação da Expressão Gênica , Algoritmos , Humanos , Modelos Biológicos
17.
J Comput Biol ; 30(6): 678-694, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-37327036

RESUMO

The problem of computing the Elementary Flux Modes (EFMs) and Minimal Cut Sets (MCSs) of metabolic network is a fundamental one in metabolic networks. A key insight is that they can be understood as a dual pair of monotone Boolean functions (MBFs). Using this insight, this computation reduces to the question of generating from an oracle a dual pair of MBFs. If one of the two sets (functions) is known, then the other can be computed through a process known as dualization. Fredman and Khachiyan provided two algorithms, which they called simply A and B that can serve as an engine for oracle-based generation or dualization of MBFs. We look at efficiencies available in implementing their algorithm B, which we will refer to as FK-B. Like their algorithm A, FK-B certifies whether two given MBFs in the form of Conjunctive Normal Form and Disjunctive Normal Form are dual or not, and in case of not being dual it returns a conflicting assignment (CA), that is, an assignment that makes one of the given Boolean functions True and the other one False. The FK-B algorithm is a recursive algorithm that searches through the tree of assignments to find a CA. If it does not find any CA, it means that the given Boolean functions are dual. In this article, we propose six techniques applicable to the FK-B and hence to the dualization process. Although these techniques do not reduce the time complexity, they considerably reduce the running time in practice. We evaluate the proposed improvements by applying them to compute the MCSs from the EFMs in the 19 small- and medium-sized models from the BioModels database along with 4 models of biomass synthesis in Escherichia coli that were used in an earlier computational survey Haus et al. (2008).


Assuntos
Algoritmos , Redes e Vias Metabólicas , Escherichia coli/metabolismo , Modelos Biológicos
18.
Artigo em Inglês | MEDLINE | ID: mdl-37200133

RESUMO

An important problem in genome comparison is the genome sorting problem, that is, the problem of finding a sequence of basic operations that transforms one genome into another whose length (possibly weighted) equals the distance between them. These sequences are called optimal sorting scenarios. However, there is usually a large number of such scenarios, and a naïve algorithm is very likely to be biased towards a specific type of scenario, impairing its usefulness in real-world applications. One way to go beyond the traditional sorting algorithms is to explore all possible solutions, looking at all the optimal sorting scenarios instead of just an arbitrary one. Another related approach is to analyze all the intermediate genomes, that is, all the genomes that can occur in an optimal sorting scenario. In this paper, we show how to enumerate the optimal sorting scenarios and the intermediate genomes between any two given genomes, under the rank distance.

19.
BMC Bioinformatics ; 13: 35, 2012 Feb 20.
Artigo em Inglês | MEDLINE | ID: mdl-22348444

RESUMO

BACKGROUND: Causal graphs are an increasingly popular tool for the analysis of biological datasets. In particular, signed causal graphs--directed graphs whose edges additionally have a sign denoting upregulation or downregulation--can be used to model regulatory networks within a cell. Such models allow prediction of downstream effects of regulation of biological entities; conversely, they also enable inference of causative agents behind observed expression changes. However, due to their complex nature, signed causal graph models present special challenges with respect to assessing statistical significance. In this paper we frame and solve two fundamental computational problems that arise in practice when computing appropriate null distributions for hypothesis testing. RESULTS: First, we show how to compute a p-value for agreement between observed and model-predicted classifications of gene transcripts as upregulated, downregulated, or neither. Specifically, how likely are the classifications to agree to the same extent under the null distribution of the observed classification being randomized? This problem, which we call "Ternary Dot Product Distribution" owing to its mathematical form, can be viewed as a generalization of Fisher's exact test to ternary variables. We present two computationally efficient algorithms for computing the Ternary Dot Product Distribution and investigate its combinatorial structure analytically and numerically to establish computational complexity bounds.Second, we develop an algorithm for efficiently performing random sampling of causal graphs. This enables p-value computation under a different, equally important null distribution obtained by randomizing the graph topology but keeping fixed its basic structure: connectedness and the positive and negative in- and out-degrees of each vertex. We provide an algorithm for sampling a graph from this distribution uniformly at random. We also highlight theoretical challenges unique to signed causal graphs; previous work on graph randomization has studied undirected graphs and directed but unsigned graphs. CONCLUSION: We present algorithmic solutions to two statistical significance questions necessary to apply the causal graph methodology, a powerful tool for biological network analysis. The algorithms we present are both fast and provably correct. Our work may be of independent interest in non-biological contexts as well, as it generalizes mathematical results that have been studied extensively in other fields.


Assuntos
Algoritmos , Modelos Biológicos , Animais , Condrócitos/citologia , Condrócitos/metabolismo , Dexametasona , Perfilação da Expressão Gênica , Hipóxia/tratamento farmacológico , Hipóxia/genética , Hipóxia/metabolismo , Camundongos , Análise de Sequência com Séries de Oligonucleotídeos , Receptores de Glucocorticoides/metabolismo , Distribuições Estatísticas
20.
BMC Bioinformatics ; 13: 46, 2012 Mar 23.
Artigo em Inglês | MEDLINE | ID: mdl-22443377

RESUMO

BACKGROUND: Identification of active causal regulators is a crucial problem in understanding mechanism of diseases or finding drug targets. Methods that infer causal regulators directly from primary data have been proposed and successfully validated in some cases. These methods necessarily require very large sample sizes or a mix of different data types. Recent studies have shown that prior biological knowledge can successfully boost a method's ability to find regulators. RESULTS: We present a simple data-driven method, Correlation Set Analysis (CSA), for comprehensively detecting active regulators in disease populations by integrating co-expression analysis and a specific type of literature-derived causal relationships. Instead of investigating the co-expression level between regulators and their regulatees, we focus on coherence of regulatees of a regulator. Using simulated datasets we show that our method performs very well at recovering even weak regulatory relationships with a low false discovery rate. Using three separate real biological datasets we were able to recover well known and as yet undescribed, active regulators for each disease population. The results are represented as a rank-ordered list of regulators, and reveals both single and higher-order regulatory relationships. CONCLUSIONS: CSA is an intuitive data-driven way of selecting directed perturbation experiments that are relevant to a disease population of interest and represent a starting point for further investigation. Our findings demonstrate that combining co-expression analysis on regulatee sets with a literature-derived network can successfully identify causal regulators and help develop possible hypothesis to explain disease progression.


Assuntos
Biologia Computacional/métodos , Redes Reguladoras de Genes , Simulação por Computador , Feminino , Humanos , Linfoma de Células B/genética , Doenças Metabólicas/genética , Neoplasias Ovarianas/genética , Tamanho da Amostra , Transcrição Gênica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA