Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Biotechnol Biofuels Bioprod ; 17(1): 20, 2024 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-38321504

RESUMO

BACKGROUND: Cost-effective production of biofuels from lignocellulose requires the fermentation of D-xylose. Many yeast species within and closely related to the genera Spathaspora and Scheffersomyces (both of the order Serinales) natively assimilate and ferment xylose. Other species consume xylose inefficiently, leading to extracellular accumulation of xylitol. Xylitol excretion is thought to be due to the different cofactor requirements of the first two steps of xylose metabolism. Xylose reductase (XR) generally uses NADPH to reduce xylose to xylitol, while xylitol dehydrogenase (XDH) generally uses NAD+ to oxidize xylitol to xylulose, creating an imbalanced redox pathway. This imbalance is thought to be particularly consequential in hypoxic or anoxic environments. RESULTS: We screened the growth of xylose-fermenting yeast species in high and moderate aeration and identified both ethanol producers and xylitol producers. Selected species were further characterized for their XR and XDH cofactor preferences by enzyme assays and gene expression patterns by RNA-Seq. Our data revealed that xylose metabolism is more redox balanced in some species, but it is strongly affected by oxygen levels. Under high aeration, most species switched from ethanol production to xylitol accumulation, despite the availability of ample oxygen to accept electrons from NADH. This switch was followed by decreases in enzyme activity and the expression of genes related to xylose metabolism, suggesting that bottlenecks in xylose fermentation are not always due to cofactor preferences. Finally, we expressed XYL genes from multiple Scheffersomyces species in a strain of Saccharomyces cerevisiae. Recombinant S. cerevisiae expressing XYL1 from Scheffersomyces xylosifermentans, which encodes an XR without a cofactor preference, showed improved anaerobic growth on xylose as the primary carbon source compared to S. cerevisiae strain expressing XYL genes from Scheffersomyces stipitis. CONCLUSION: Collectively, our data do not support the hypothesis that xylitol accumulation occurs primarily due to differences in cofactor preferences between xylose reductase and xylitol dehydrogenase; instead, gene expression plays a major role in response to oxygen levels. We have also identified the yeast Sc. xylosifermentans as a potential source for genes that can be engineered into S. cerevisiae to improve xylose fermentation and biofuel production.

2.
Mol Phylogenet Evol ; 189: 107938, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37820761

RESUMO

The order Sordariales is taxonomically diverse, and harbours many species with different lifestyles and large economic importance. Despite its importance, a robust genome-scale phylogeny, and associated comparative genomic analysis of the order is lacking. In this study, we examined whole-genome data from 99 Sordariales, including 52 newly sequenced genomes, and seven outgroup taxa. We inferred a comprehensive phylogeny that resolved several contentious relationships amongst families in the order, and cleared-up intrafamily relationships within the Podosporaceae. Extensive comparative genomics showed that genomes from the three largest families in the dataset (Chaetomiaceae, Podosporaceae and Sordariaceae) differ greatly in GC content, genome size, gene number, repeat percentage, evolutionary rate, and genome content affected by repeat-induced point mutations (RIP). All genomic traits showed phylogenetic signal, and ancestral state reconstruction revealed that the variation of the properties stems primarily from within-family evolution. Together, the results provide a thorough framework for understanding genome evolution in this important group of fungi.


Assuntos
Genômica , Sordariales , Humanos , Filogenia , Genômica/métodos , Genoma , Sordariales/genética , Sequência de Bases , Evolução Molecular
3.
Nat Microbiol ; 8(9): 1668-1681, 2023 09.
Artigo em Inglês | MEDLINE | ID: mdl-37550506

RESUMO

The fungal genus Armillaria contains necrotrophic pathogens and some of the largest terrestrial organisms that cause tremendous losses in diverse ecosystems, yet how they evolved pathogenicity in a clade of dominantly non-pathogenic wood degraders remains elusive. Here we show that Armillaria species, in addition to gene duplications and de novo gene origins, acquired at least 1,025 genes via 124 horizontal gene transfer events, primarily from Ascomycota. Horizontal gene transfer might have affected plant biomass degrading and virulence abilities of Armillaria, and provides an explanation for their unusual, soft rot-like wood decay strategy. Combined multi-species expression data revealed extensive regulation of horizontally acquired and wood-decay related genes, putative virulence factors and two novel conserved pathogenicity-induced small secreted proteins, which induced necrosis in planta. Overall, this study details how evolution knitted together horizontally and vertically inherited genes in complex adaptive traits of plant biomass degradation and pathogenicity in important fungal pathogens.


Assuntos
Armillaria , Armillaria/genética , Armillaria/metabolismo , Biomassa , Transferência Genética Horizontal , Ecossistema , Plantas
4.
New Phytol ; 236(3): 1154-1167, 2022 11.
Artigo em Inglês | MEDLINE | ID: mdl-35898177

RESUMO

Wildfires drastically impact the soil environment, altering the soil organic matter, forming pyrolyzed compounds, and markedly reducing the diversity of microorganisms. Pyrophilous fungi, especially the species from the orders Pezizales and Agaricales, are fire-responsive fungal colonizers of post-fire soil that have historically been found fruiting on burned soil and thus may encode mechanisms of processing these compounds in their genomes. Pyrophilous fungi are diverse. In this work, we explored this diversity and sequenced six new genomes of pyrophilous Pezizales fungi isolated after the 2013 Rim Fire near Yosemite Park in California, USA: Pyronema domesticum, Pyronema omphalodes, Tricharina praecox, Geopyxis carbonaria, Morchella snyderi, and Peziza echinospora. A comparative genomics analysis revealed the enrichment of gene families involved in responses to stress and the degradation of pyrolyzed organic matter. In addition, we found that both protein sequence lengths and G + C content in the third base of codons (GC3) in pyrophilous fungi fall between those in mesophilic/nonpyrophilous and thermophilic fungi. A comparative transcriptome analysis of P. domesticum under two conditions - growing on charcoal, and during sexual development - identified modules of genes that are co-expressed in the charcoal and light-induced sexual development conditions. In addition, environmental sensors such as transcription factors STE12, LreA, LreB, VosA, and EsdC were upregulated in the charcoal condition. Taken together, these results highlight genomic adaptations of pyrophilous fungi and indicate a potential connection between charcoal tolerance and fruiting body formation in P. domesticum.


Assuntos
Carvão Vegetal , Genômica , Fungos , Desenvolvimento Sexual , Solo , Fatores de Transcrição
5.
New Phytol ; 233(3): 1317-1330, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-34797921

RESUMO

Although secondary metabolites are typically associated with competitive or pathogenic interactions, the high bioactivity of endophytic fungi in the Xylariales, coupled with their abundance and broad host ranges spanning all lineages of land plants and lichens, suggests that enhanced secondary metabolism might facilitate symbioses with phylogenetically diverse hosts. Here, we examined secondary metabolite gene clusters (SMGCs) across 96 Xylariales genomes in two clades (Xylariaceae s.l. and Hypoxylaceae), including 88 newly sequenced genomes of endophytes and closely related saprotrophs and pathogens. We paired genomic data with extensive metadata on endophyte hosts and substrates, enabling us to examine genomic factors related to the breadth of symbiotic interactions and ecological roles. All genomes contain hyperabundant SMGCs; however, Xylariaceae have increased numbers of gene duplications, horizontal gene transfers (HGTs) and SMGCs. Enhanced metabolic diversity of endophytes is associated with a greater diversity of hosts and increased capacity for lignocellulose decomposition. Our results suggest that, as host and substrate generalists, Xylariaceae endophytes experience greater selection to diversify SMGCs compared with more ecologically specialised Hypoxylaceae species. Overall, our results provide new evidence that SMGCs may facilitate symbiosis with phylogenetically diverse hosts, highlighting the importance of microbial symbioses to drive fungal metabolic diversity.


Assuntos
Líquens , Xylariales , Endófitos , Fungos , Líquens/microbiologia , Família Multigênica , Simbiose/genética
6.
Nat Commun ; 11(1): 5125, 2020 10 12.
Artigo em Inglês | MEDLINE | ID: mdl-33046698

RESUMO

Mycorrhizal fungi are mutualists that play crucial roles in nutrient acquisition in terrestrial ecosystems. Mycorrhizal symbioses arose repeatedly across multiple lineages of Mucoromycotina, Ascomycota, and Basidiomycota. Considerable variation exists in the capacity of mycorrhizal fungi to acquire carbon from soil organic matter. Here, we present a combined analysis of 135 fungal genomes from 73 saprotrophic, endophytic and pathogenic species, and 62 mycorrhizal species, including 29 new mycorrhizal genomes. This study samples ecologically dominant fungal guilds for which there were previously no symbiotic genomes available, including ectomycorrhizal Russulales, Thelephorales and Cantharellales. Our analyses show that transitions from saprotrophy to symbiosis involve (1) widespread losses of degrading enzymes acting on lignin and cellulose, (2) co-option of genes present in saprotrophic ancestors to fulfill new symbiotic functions, (3) diversification of novel, lineage-specific symbiosis-induced genes, (4) proliferation of transposable elements and (5) divergent genetic innovations underlying the convergent origins of the ectomycorrhizal guild.


Assuntos
Fungos/genética , Genoma Fúngico , Micorrizas/genética , Simbiose , Ecossistema , Evolução Molecular , Proteínas Fúngicas/genética , Fungos/classificação , Fungos/fisiologia , Micorrizas/classificação , Micorrizas/fisiologia , Filogenia , Fenômenos Fisiológicos Vegetais , Plantas/microbiologia
7.
ISME J ; 14(4): 881-895, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-31896786

RESUMO

Ocean viruses are abundant and infect 20-40% of surface microbes. Infected cells, termed virocells, are thus a predominant microbial state. Yet, virocells and their ecosystem impacts are understudied, thus precluding their incorporation into ecosystem models. Here we investigated how unrelated bacterial viruses (phages) reprogram one host into contrasting virocells with different potential ecosystem footprints. We independently infected the marine Pseudoalteromonas bacterium with siphovirus PSA-HS2 and podovirus PSA-HP1. Time-resolved multi-omics unveiled drastically different metabolic reprogramming and resource requirements by each virocell, which were related to phage-host genomic complementarity and viral fitness. Namely, HS2 was more complementary to the host in nucleotides and amino acids, and fitter during infection than HP1. Functionally, HS2 virocells hardly differed from uninfected cells, with minimal host metabolism impacts. HS2 virocells repressed energy-consuming metabolisms, including motility and translation. Contrastingly, HP1 virocells substantially differed from uninfected cells. They repressed host transcription, responded to infection continuously, and drastically reprogrammed resource acquisition, central carbon and energy metabolisms. Ecologically, this work suggests that one cell, infected versus uninfected, can have immensely different metabolisms that affect the ecosystem differently. Finally, we relate phage-host genome complementarity, virocell metabolic reprogramming, and viral fitness in a conceptual model to guide incorporating viruses into ecosystem models.


Assuntos
Bacteriófagos/fisiologia , Pseudoalteromonas/virologia , Bacteriófagos/genética , Ecologia , Ecossistema , Microbiologia Ambiental , Vírus/genética
8.
Appl Microbiol Biotechnol ; 103(19): 8145-8155, 2019 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-31482283

RESUMO

The environmental accumulation of polycyclic aromatic hydrocarbons (PAHs) is of great concern due to potential carcinogenic and mutagenic risks, as well as their resistance to remediation. While many fungi have been reported to break down PAHs in environments, the details of gene-based metabolic pathways are not yet comprehensively understood. Specifically, the genome-scale transcriptional responses of fungal PAH degradation have rarely been reported. In this study, we report the genomic and transcriptomic basis of PAH bioremediation by a potent fungal degrader, Dentipellis sp. KUC8613. The genome size of this fungus was 36.71 Mbp long encoding 14,320 putative protein-coding genes. The strain efficiently removed more than 90% of 100 mg/l concentration of PAHs within 10 days. The genomic and transcriptomic analysis of this white rot fungus highlights that the strain primarily utilized non-ligninolytic enzymes to remove various PAHs, rather than typical ligninolytic enzymes known for playing important roles in PAH degradation. PAH removal by non-ligninolytic enzymes was initiated by both different PAH-specific and common upregulation of P450s, followed by downstream PAH-transforming enzymes such as epoxide hydrolases, dehydrogenases, FAD-dependent monooxygenases, dioxygenases, and glycosyl- or glutathione transferases. Among the various PAHs, phenanthrene induced a more dynamic transcriptomic response possibly due to its greater cytotoxicity, leading to highly upregulated genes involved in the translocation of PAHs, a defense system against reactive oxygen species, and ATP synthesis. Our genomic and transcriptomic data provide a foundation of understanding regarding the mycoremediation of PAHs and the application of this strain for polluted environments.


Assuntos
Basidiomycota/genética , Basidiomycota/metabolismo , Perfilação da Expressão Gênica , Genômica , Redes e Vias Metabólicas/genética , Hidrocarbonetos Policíclicos Aromáticos/metabolismo , Biotransformação
9.
Microbiol Resour Announc ; 8(18)2019 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-31048399

RESUMO

Here, we report the draft genome sequences of three isolates of the wood-decaying white-rot basidiomycete fungus Dichomitus squalens The genomes of these monokaryons were sequenced to provide more information on the intraspecies genomic diversity of this fungus and were compared to the previously sequenced genome of D. squalens LYAD-421 SS1.

10.
Nat Microbiol ; 3(12): 1417-1428, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30297742

RESUMO

Environmental DNA surveys reveal that most fungal diversity represents uncultured species. We sequenced the genomes of eight uncultured species across the fungal tree of life using a new single-cell genomics pipeline. We show that, despite a large variation in genome and gene space recovery from each single amplified genome (SAG), ≥90% can be recovered by combining multiple SAGs. SAGs provide robust placement for early-diverging lineages and infer a diploid ancestor of fungi. Early-diverging fungi share metabolic deficiencies and show unique gene expansions correlated with parasitism and unculturability. Single-cell genomics holds great promise in exploring fungal diversity, life cycles and metabolic potential.


Assuntos
Fungos/genética , Fungos/metabolismo , Genoma Fúngico , Genômica , Biodiversidade , DNA Ribossômico/genética , Fungos/classificação , Fungos/enzimologia , Variação Genética , Heterozigoto , Estágios do Ciclo de Vida , Redes e Vias Metabólicas/genética , Redes e Vias Metabólicas/fisiologia , Filogenia , Polimorfismo Genético , RNA Ribossômico 18S/genética , Metabolismo Secundário/genética , Metabolismo Secundário/fisiologia , Análise de Sequência de DNA , Simbiose/genética , Simbiose/fisiologia
11.
Nat Genet ; 49(6): 964-968, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-28481340

RESUMO

N6-methyldeoxyadenine (6mA) is a noncanonical DNA base modification present at low levels in plant and animal genomes, but its prevalence and association with genome function in other eukaryotic lineages remains poorly understood. Here we report that abundant 6mA is associated with transcriptionally active genes in early-diverging fungal lineages. Using single-molecule long-read sequencing of 16 diverse fungal genomes, we observed that up to 2.8% of all adenines were methylated in early-diverging fungi, far exceeding levels observed in other eukaryotes and more derived fungi. 6mA occurred symmetrically at ApT dinucleotides and was concentrated in dense methylated adenine clusters surrounding the transcriptional start sites of expressed genes; its distribution was inversely correlated with that of 5-methylcytosine. Our results show a striking contrast in the genomic distributions of 6mA and 5-methylcytosine and reinforce a distinct role for 6mA as a gene-expression-associated epigenomic mark in eukaryotes.


Assuntos
Adenina/metabolismo , Metilação de DNA , Fungos/genética , 5-Metilcitosina/metabolismo , Epigênese Genética , Regulação Fúngica da Expressão Gênica , Genoma Fúngico , Filogenia , Sítio de Iniciação de Transcrição
12.
Sci Data ; 3: 160081, 2016 Sep 27.
Artigo em Inglês | MEDLINE | ID: mdl-27673566

RESUMO

Generating sequence data of a defined community composed of organisms with complete reference genomes is indispensable for the benchmarking of new genome sequence analysis methods, including assembly and binning tools. Moreover the validation of new sequencing library protocols and platforms to assess critical components such as sequencing errors and biases relies on such datasets. We here report the next generation metagenomic sequence data of a defined mock community (Mock Bacteria ARchaea Community; MBARC-26), composed of 23 bacterial and 3 archaeal strains with finished genomes. These strains span 10 phyla and 14 classes, a range of GC contents, genome sizes, repeat content and encompass a diverse abundance profile. Short read Illumina and long-read PacBio SMRT sequences of this mock community are described. These data represent a valuable resource for the scientific community, enabling extensive benchmarking and comparative evaluation of bioinformatics tools without the need to simulate data. As such, these data can aid in improving our current sequence data analysis toolkit and spur interest in the development of new tools.

13.
Cancer Inform ; 11: 61-75, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22570537

RESUMO

Gene expression profiling has provided insights into different cancer types and revealed tissue-specific expression signatures. Alterations in microRNA expression contribute to the pathogenesis of many types of human diseases. Few studies have integrated all levels of gene expression, miRNA and methylation to uncover correlations between these data types. We performed an integrated profiling to discover instances of miRNAs associated with a gene expression and DNA methylation signature across multiple cancer types. Using data from The Cancer Genome Atlas (TCGA), we revealed a concordant gene expression and methylation signature associated with the microRNA hsa-miR-142 across the same samples. In all cancer types examined, we found a signature of co-expression of a gene set R and methylated sites M, which correlate positively (M+) or negatively (M-) with the expression of hsa-miR-142. The set R consistently contains many genes, such as TRAF3IP3, NCKAP1L, CD53, LAPTM5, PTPRC, EVI2B, DOCK2, LCP2, CYBB and FYB. The signature is preserved across glioblastoma, ovarian, breast, colon, kidney, lung, uterine and rectum cancer. There is 28% overlap of methylation sites in M between glioblastoma (GBM) and ovarian cancer. There is 60% overlap of genes in R between GBM and ovarian (P = 1.3e(-11)). Most of the genes in R are known to be expressed in lymphocytes and haematopoietic stem cells, while M reflects membrane proteins involved in cell-cell adhesion functions. We speculate that the hsa-miR-142 associated signature may signal haematopoietic-specific processes and an accumulation of methylation events triggering a progressive loss of cell-cell adhesion. We also observed that GBM samples belonging to the proneural subtype tend to have underexpressed hsa-miR-142 and R genes, hypomethylated M+ and hypermethylated M-, while the mesenchymal samples have the opposite profile.

14.
Algorithms Mol Biol ; 6(1): 16, 2011 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-21645400

RESUMO

BACKGROUND: Single-molecule force spectroscopy (SMFS) is a technique that measures the force necessary to unfold a protein. SMFS experiments generate Force-Distance (F-D) curves. A statistical analysis of a set of F-D curves reveals different unfolding pathways. Information on protein structure, conformation, functional states, and inter- and intra-molecular interactions can be derived. RESULTS: In the present work, we propose a pattern recognition algorithm and apply our algorithm to datasets from SMFS experiments on the membrane protein bacterioRhodopsin (bR). We discuss the unfolding pathways found in bR, which are characterised by main peaks and side peaks. A main peak is the result of the pairwise unfolding of the transmembrane helices. In contrast, a side peak is an unfolding event in the alpha-helix or other secondary structural element. The algorithm is capable of detecting side peaks along with main peaks.Therefore, we can detect the individual unfolding pathway as the sequence of events labeled with their occurrences and co-occurrences special to bR's unfolding pathway. We find that side peaks do not co-occur with one another in curves as frequently as main peaks do, which may imply a synergistic effect occurring between helices. While main peaks co-occur as pairs in at least 50% of curves, the side peaks co-occur with one another in less than 10% of curves. Moreover, the algorithm runtime scales well as the dataset size increases. CONCLUSIONS: Our algorithm satisfies the requirements of an automated methodology that combines high accuracy with efficiency in analyzing SMFS datasets. The algorithm tackles the force spectroscopy analysis bottleneck leading to more consistent and reproducible results.

15.
BMC Bioinformatics ; 10: 196, 2009 Jun 27.
Artigo em Inglês | MEDLINE | ID: mdl-19558694

RESUMO

BACKGROUND: A lot of high-throughput studies produce protein-protein interaction networks (PPINs) with many errors and missing information. Even for genome-wide approaches, there is often a low overlap between PPINs produced by different studies. Second-level neighbors separated by two protein-protein interactions (PPIs) were previously used for predicting protein function and finding complexes in high-error PPINs. We retrieve second level neighbors in PPINs, and complement these with structural domain-domain interactions (SDDIs) representing binding evidence on proteins, forming PPI-SDDI-PPI triangles. RESULTS: We find low overlap between PPINs, SDDIs and known complexes, all well below 10%. We evaluate the overlap of PPI-SDDI-PPI triangles with known complexes from Munich Information center for Protein Sequences (MIPS). PPI-SDDI-PPI triangles have ~20 times higher overlap with MIPS complexes than using second-level neighbors in PPINs without SDDIs. The biological interpretation for triangles is that a SDDI causes two proteins to be observed with common interaction partners in high-throughput experiments. The relatively few SDDIs overlapping with PPINs are part of highly connected SDDI components, and are more likely to be detected in experimental studies. We demonstrate the utility of PPI-SDDI-PPI triangles by reconstructing myosin-actin processes in the nucleus, cytoplasm, and cytoskeleton, which were not obvious in the original PPIN. Using other complementary datatypes in place of SDDIs to form triangles, such as PubMed co-occurrences or threading information, results in a similar ability to find protein complexes. CONCLUSION: Given high-error PPINs with missing information, triangles of mixed datatypes are a promising direction for finding protein complexes. Integrating PPINs with SDDIs improves finding complexes. Structural SDDIs partially explain the high functional similarity of second-level neighbors in PPINs. We estimate that relatively little structural information would be sufficient for finding complexes involving most of the proteins and interactions in a typical PPIN.


Assuntos
Biologia Computacional/métodos , Mapeamento de Interação de Proteínas/métodos , Proteínas/química , Sítios de Ligação , Bases de Dados de Proteínas , Proteínas/metabolismo
16.
Brief Bioinform ; 10(3): 297-314, 2009 May.
Artigo em Inglês | MEDLINE | ID: mdl-19240124

RESUMO

Clustering is ubiquitously applied in bioinformatics with hierarchical clustering and k-means partitioning being the most popular methods. Numerous improvements of these two clustering methods have been introduced, as well as completely different approaches such as grid-based, density-based and model-based clustering. For improved bioinformatics analysis of data, it is important to match clusterings to the requirements of a biomedical application. In this article, we present a set of desirable clustering features that are used as evaluation criteria for clustering algorithms. We review 40 different clustering algorithms of all approaches and datatypes. We compare algorithms on the basis of desirable clustering features, and outline algorithms' benefits and drawbacks as a basis for matching them to biomedical applications.


Assuntos
Algoritmos , Análise por Conglomerados , Biologia Computacional , Perfilação da Expressão Gênica/métodos , Armazenamento e Recuperação da Informação , Modelos Estatísticos , Mapeamento de Interação de Proteínas
17.
BMC Bioinformatics ; 10: 28, 2009 Jan 21.
Artigo em Inglês | MEDLINE | ID: mdl-19159460

RESUMO

BACKGROUND: Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively. RESULTS: The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate. CONCLUSION: Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation. AVAILABILITY: The three benchmark datasets created for the purpose of disambiguation are available in Additional file 1.


Assuntos
Biologia Computacional/métodos , Vocabulário Controlado , Algoritmos , Armazenamento e Recuperação da Informação , Informática Médica/métodos , Medical Subject Headings , Reconhecimento Automatizado de Padrão , Unified Medical Language System
18.
J Biomed Inform ; 42(2): 365-76, 2009 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-19111944

RESUMO

A challenge involved in applying density-based clustering to categorical biomedical data is that the "cube" of attribute values has no ordering defined, making the search for dense subspaces slow. We propose the HIERDENC algorithm for hierarchical density-based clustering of categorical data, and a complementary index for searching for dense subspaces efficiently. The HIERDENC index is updated when new objects are introduced, such that clustering does not need to be repeated on all objects. The updating and cluster retrieval are efficient. Comparisons with several other clustering algorithms showed that on large datasets HIERDENC achieved better runtime scalability on the number of objects, as well as cluster quality. By fast collapsing the bicliques in large networks we achieved an edge reduction of as much as 86.5%. HIERDENC is suitable for large and quickly growing datasets, since it is independent of object ordering, does not require re-clustering when new data emerges, and requires no user-specified input parameters.


Assuntos
Análise por Conglomerados , Biologia Computacional/métodos , Bases de Dados como Assunto , Algoritmos
19.
Int J Data Min Bioinform ; 2(3): 193-215, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-19024494

RESUMO

With more and more genomes being sequenced, a lot of effort is devoted to their annotation with terms from controlled vocabularies such as the GeneOntology. Manual annotation based on relevant literature is tedious, but automation of this process is difficult. One particularly challenging problem is word sense disambiguation. Terms such as 'development' can refer to developmental biology or to the more general sense. Here, we present two approaches to address this problem by using term co-occurrences and document clustering. To evaluate our method we defined a corpus of 331 documents on development and developmental biology. Term co-occurrence analysis achieves an F-measure of 77%. Additionally, applying document clustering improves precision to 82%. We applied the same approach to disambiguate 'nucleus', 'transport', and 'spindle', and we achieved consistent results. Thus, our method is a viable approach towards the automation of literature-based genome annotation.


Assuntos
Inteligência Artificial , Análise por Conglomerados , Documentação/métodos , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Semântica , Terminologia como Assunto
20.
PLoS Comput Biol ; 4(7): e1000108, 2008 Jul 11.
Artigo em Inglês | MEDLINE | ID: mdl-18617988

RESUMO

Networks play a crucial role in computational biology, yet their analysis and representation is still an open problem. Power Graph Analysis is a lossless transformation of biological networks into a compact, less redundant representation, exploiting the abundance of cliques and bicliques as elementary topological motifs. We demonstrate with five examples the advantages of Power Graph Analysis. Investigating protein-protein interaction networks, we show how the catalytic subunits of the casein kinase II complex are distinguishable from the regulatory subunits, how interaction profiles and sequence phylogeny of SH3 domains correlate, and how false positive interactions among high-throughput interactions are spotted. Additionally, we demonstrate the generality of Power Graph Analysis by applying it to two other types of networks. We show how power graphs induce a clustering of both transcription factors and target genes in bipartite transcription networks, and how the erosion of a phosphatase domain in type 22 non-receptor tyrosine phosphatases is detected. We apply Power Graph Analysis to high-throughput protein interaction networks and show that up to 85% (56% on average) of the information is redundant. Experimental networks are more compressible than rewired ones of same degree distribution, indicating that experimental networks are rich in cliques and bicliques. Power Graphs are a novel representation of networks, which reduces network complexity by explicitly representing re-occurring network motifs. Power Graphs compress up to 85% of the edges in protein interaction networks and are applicable to all types of networks such as protein interactions, regulatory networks, or homology networks.


Assuntos
Biologia Computacional/métodos , Modelos Biológicos , Redes Neurais de Computação , Motivos de Aminoácidos/fisiologia , Animais , Sítios de Ligação/fisiologia , Caseína Quinase II/química , Caseína Quinase II/metabolismo , Domínio Catalítico , Análise por Conglomerados , Simulação por Computador , Compressão de Dados/métodos , Evolução Molecular , Humanos , Ligação Proteica/genética , Mapeamento de Interação de Proteínas/métodos , Proteína Tirosina Fosfatase não Receptora Tipo 22/metabolismo , Proteínas/química , Proteínas/genética , Análise de Sequência de Proteína/métodos , Homologia Estrutural de Proteína , Fatores de Transcrição/fisiologia , Domínios de Homologia de src/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...