Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
BMC Bioinformatics ; 25(1): 181, 2024 May 08.
Artigo em Inglês | MEDLINE | ID: mdl-38720247

RESUMO

BACKGROUND: RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. RESULTS: We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. CONCLUSION: By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.


Assuntos
Aprendizado de Máquina , Neoplasias , RNA-Seq , Humanos , RNA-Seq/métodos , Neoplasias/genética , Transcriptoma/genética , Análise de Sequência de RNA/métodos , Perfilação da Expressão Gênica/métodos , Biologia Computacional/métodos
2.
Data Brief ; 45: 108641, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-36426049

RESUMO

The data in this article are associated with the research paper "GigaAssay - an adaptable high-throughput saturation mutagenesis assay" [1]. The raw data are sequence reads of HIV-1 Tat cDNA amplified from cellular genomic DNA in a new single-pot saturation mutagenesis assay designated the "GigaAssay". A bioinformatic pipeline and parameters used to analyze the data. Raw, processed, analyzed, and filtered data are reported. The data is processed to calculate the Tat-driven transcription activity for cells with each possible single amino acid substitution in Tat. This data can be reused to interpret Tat intermolecular interactions and HIV latency. This is one of the largest and most complete datasets regarding the impact of amino acid substitutions within a single protein on a molecular function.

3.
Genomics ; 114(4): 110439, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35905834

RESUMO

High-throughput assay systems have had a large impact on understanding the mechanisms of basic cell functions. However, high-throughput assays that directly assess molecular functions are limited. Herein, we describe the "GigaAssay", a modular high-throughput one-pot assay system for measuring molecular functions of thousands of genetic variants at once. In this system, each cell was infected with one virus from a library encoding thousands of Tat mutant proteins, with each viral particle encoding a random unique molecular identifier (UMI). We demonstrate proof of concept by measuring transcription of a GFP reporter in an engineered reporter cell line driven by binding of the HIV Tat transcription factor to the HIV long terminal repeat. Infected cells were flow-sorted into 3 bins based on their GFP fluorescence readout. The transcriptional activity of each Tat mutant was calculated from the ratio of signals from each bin. The use of UMIs in the GigaAssay produced a high average accuracy (95%) and positive predictive value (98%) determined by comparison to literature benchmark data, known C-terminal truncations, and blinded independent mutant tests. Including the substitution tolerance with structure/function analysis shows restricted substitution types spatially concentrated in the Cys-rich region. Tat has abundant intragenic epistasis (10%) when single and double mutants are compared.


Assuntos
HIV-1 , Produtos do Gene tat do Vírus da Imunodeficiência Humana , Linhagem Celular , Repetição Terminal Longa de HIV , HIV-1/genética , Mutagênese , Ativação Transcricional , Produtos do Gene tat do Vírus da Imunodeficiência Humana/genética , Produtos do Gene tat do Vírus da Imunodeficiência Humana/metabolismo
4.
Trends Genet ; 38(1): 12-21, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-34340871

RESUMO

Human specific endogenous retrovirus H (HERVH) is highly expressed in both naive and primed stem cells and is essential for pluripotency. Despite the proven relationship between HERVH expression and pluripotency, there is no single definitive model for the function of HERVH. Instead, several hypotheses of a regulatory function have been put forward including HERVH acting as enhancers, long noncoding RNAs (lncRNAs), and most recently as markers of topologically associating domain (TAD) boundaries. Recently several enhancer-associated lncRNAs have been characterized, which bind to Mediator and are necessary for promoter-enhancer folding interactions. We propose a synergistic model of HERVH function combining relevant findings and discuss the current limitations for its role in regulation, including the lack of evidence for a pluripotency-associated target gene.


Assuntos
Retrovirus Endógenos , RNA Longo não Codificante , Retrovirus Endógenos/metabolismo , Elementos Facilitadores Genéticos , Humanos , RNA Longo não Codificante/metabolismo , Células-Tronco/metabolismo
5.
Sci Rep ; 11(1): 4482, 2021 02 24.
Artigo em Inglês | MEDLINE | ID: mdl-33627720

RESUMO

The study aimed to utilize machine learning (ML) approaches and genomic data to develop a prediction model for bone mineral density (BMD) and identify the best modeling approach for BMD prediction. The genomic and phenotypic data of Osteoporotic Fractures in Men Study (n = 5130) was analyzed. Genetic risk score (GRS) was calculated from 1103 associated SNPs for each participant after a comprehensive genotype imputation. Data were normalized and divided into a training set (80%) and a validation set (20%) for analysis. Random forest, gradient boosting, neural network, and linear regression were used to develop BMD prediction models separately. Ten-fold cross-validation was used for hyper-parameters optimization. Mean square error and mean absolute error were used to assess model performance. When using GRS and phenotypic covariates as the predictors, all ML models' performance and linear regression in BMD prediction were similar. However, when replacing GRS with the 1103 individual SNPs in the model, ML models performed significantly better than linear regression (with lasso regularization), and the gradient boosting model performed the best. Our study suggested that ML models, especially gradient boosting, can improve BMD prediction in genomic data.


Assuntos
Densidade Óssea/genética , Densidade Óssea/fisiologia , Idoso , Fraturas Ósseas/genética , Fraturas Ósseas/patologia , Genômica/métodos , Genótipo , Humanos , Modelos Lineares , Aprendizado de Máquina , Masculino , Polimorfismo de Nucleotídeo Único/genética , Medição de Risco , Fatores de Risco
6.
Calcif Tissue Int ; 107(4): 353-361, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-32728911

RESUMO

The study aims were to develop fracture prediction models by using machine learning approaches and genomic data, as well as to identify the best modeling approach for fracture prediction. The genomic data of Osteoporotic Fractures in Men, cohort Study (n = 5130), were analyzed. After a comprehensive genotype imputation, genetic risk score (GRS) was calculated from 1103 associated Single Nucleotide Polymorphisms for each participant. Data were normalized and split into a training set (80%) and a validation set (20%) for analysis. Random forest, gradient boosting, neural network, and logistic regression were used to develop prediction models for major osteoporotic fractures separately, with GRS, bone density, and other risk factors as predictors. In model training, the synthetic minority oversampling technique was used to account for low fracture rate, and tenfold cross-validation was employed for hyperparameters optimization. In the testing, the area under curve (AUC) and accuracy were used to assess the model performance. The McNemar test was employed to examine the accuracy difference between models. The results showed that the prediction performance of gradient boosting was the best, with AUC of 0.71 and an accuracy of 0.88, and the GRS ranked as the 7th most important variable in the model. The performance of random forest and neural network were also significantly better than that of logistic regression. This study suggested that improving fracture prediction in older men can be achieved by incorporating genetic profiling and by utilizing the gradient boosting approach. This result should not be extrapolated to women or young individuals.


Assuntos
Densidade Óssea , Fraturas Ósseas/diagnóstico , Aprendizado de Máquina , Medição de Risco , Atividades Cotidianas , Idoso , Idoso de 80 Anos ou mais , Estudos de Coortes , Genômica , Humanos , Masculino , Fenótipo
7.
Bioessays ; 41(12): e1900126, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31693213

RESUMO

Genome editing with engineered nucleases (GEENs) introduce site-specific DNA double-strand breaks (DSBs) and repairs DSBs via nonhomologous end-joining (NHEJ) pathways that eventually create indels (insertions/deletions) in a genome. Whether the features of indels resulting from gene editing could be customized is asked. A review of the literature reveals how gene editing technologies via NHEJ pathways impact gene editing. The survey consolidates a body of literature that suggests that the type (insertion, deletion, and complex) and the approximate length of indel edits can be somewhat customized with different GEENs and by manipulating the expression of key NHEJ genes. Structural data suggest that binding of GEENs to DNA may interfere with binding of key components of DNA repair complexes, favoring either classical- or alternative-NHEJ. The hypotheses have some limitations, but if validated, will enable scientists to better control indel makeup, holding promise for basic science and clinical applications of gene editing. Also see the video abstract here https://youtu.be/vTkJtUsLi3w.


Assuntos
Edição de Genes/métodos , Sistemas CRISPR-Cas/genética , DNA/genética , DNA/metabolismo , Quebras de DNA de Cadeia Dupla , Humanos , Nucleases dos Efetores Semelhantes a Ativadores de Transcrição/metabolismo , Nucleases de Dedos de Zinco/metabolismo
9.
Mob DNA ; 10: 39, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31497073

RESUMO

BACKGROUND: Despite the long-held assumption that transposons are normally only expressed in the germ-line, recent evidence shows that transcripts of transposable element (TE) sequences are frequently found in the somatic cells. However, the extent of variation in TE transcript levels across different tissues and different individuals are unknown, and the co-expression between TEs and host gene mRNAs have not been examined. RESULTS: Here we report the variation in TE derived transcript levels across tissues and between individuals observed in the non-tumorous tissues collected for The Cancer Genome Atlas. We found core TE co-expression modules consisting mainly of transposons, showing correlated expression across broad classes of TEs. Despite this co-expression within tissues, there are individual TE loci that exhibit tissue-specific expression patterns, when compared across tissues. The core TE modules were negatively correlated with other gene modules that consisted of immune response genes in interferon signaling. KRAB Zinc Finger Proteins (KZFPs) were over-represented gene members of the TE modules, showing positive correlation across multiple tissues. But we did not find overlap between TE-KZFP pairs that are co-expressed and TE-KZFP pairs that are bound in published ChIP-seq studies. CONCLUSIONS: We find unexpected variation in TE derived transcripts, within and across non-tumorous tissues. We describe a broad view of the RNA state for non-tumorous tissues exhibiting higher level of TE transcripts. Tissues with higher level of TE transcripts have a broad range of TEs co-expressed, with high expression of a large number of KZFPs, and lower RNA levels of immune genes.

10.
Mob DNA ; 10: 29, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31320939

RESUMO

Though transposable elements make up around half of the human genome, the repetitive nature of their sequences makes it difficult to accurately align conventional sequencing reads. However, in light of new advances in sequencing technology, such as increased read length and paired-end libraries, these repetitive regions are now becoming easier to align to. This study investigates the mappability of transposable elements with 50 bp, 76 bp and 100 bp paired-end read libraries. With respect to those read lengths and allowing for 3 mismatches during alignment, over 68, 85, and 88% of all transposable elements in the RepeatMasker database are uniquely mappable, suggesting that accurate locus-specific mapping of older transposable elements is well within reach.

11.
Mol Biol Evol ; 35(1): 50-65, 2018 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-29309688

RESUMO

Experimental evolution affords the opportunity to investigate adaptation to stressful environments. Studies combining experimental evolution with whole-genome resequencing have provided insight into the dynamics of adaptation and a new tool to uncover genes associated with polygenic traits. Here, we selected for starvation resistance in populations of Drosophila melanogaster for over 80 generations. In response, the starvation-selected lines developed an obese condition, storing nearly twice the level of total lipids than their unselected controls. Although these fats provide a ∼3-fold increase in starvation resistance, the imbalance in lipid homeostasis incurs evolutionary cost. Some of these tradeoffs resemble obesity-associated pathologies in mammals including metabolic depression, low activity levels, dilated cardiomyopathy, and disrupted sleeping patterns. To determine the genetic basis of these traits, we resequenced genomic DNA from the selected lines and their controls. We found 1,046,373 polymorphic sites, many of which diverged between selection treatments. In addition, we found a wide range of genetic heterogeneity between the replicates of the selected lines, suggesting multiple mechanisms of adaptation. Genome-wide heterozygosity was low in the selected populations, with many large blocks of SNPs nearing fixation. We found candidate loci under selection by using an algorithm to control for the effects of genetic drift. These loci were mapped to a set of 382 genes, which associated with many processes including nutrient response, catabolic metabolism, and lipid droplet function. The results of our study speak to the evolutionary origins of obesity and provide new targets to understand the polygenic nature of obesity in a unique model system.


Assuntos
Drosophila melanogaster/genética , Obesidade/genética , Inanição/genética , Aclimatação , Adaptação Fisiológica/genética , Animais , Evolução Molecular Direcionada/métodos , Modelos Animais de Doenças , Evolução Molecular , Genoma de Inseto/genética , Estudo de Associação Genômica Ampla/métodos , Modelos Genéticos , Herança Multifatorial , Seleção Genética/genética
12.
J Mol Evol ; 83(3-4): 137-146, 2016 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-27770175

RESUMO

Evolutionary constraint for insertions and deletions (indels) is not necessarily equal to constraint for nucleotide substitutions for any given region of a genome. Knowing the variation in indel-specific evolutionary rates across the sequence will aid our understanding of evolutionary constraints on indels, and help us infer how indels have contributed to the evolution of the sequence. However, unlike for nucleotide substitutions, there has been no phylogenetic method that can statistically infer significantly different rates of indels across the sequence space independent of substitution rates. Here, we have developed a software that will find sites with accelerated evolutionary rates specific to indels, by introducing a scaling parameter that only applies to the indel rates and not to the nucleotide substitution rates. Using the software, we show that we can find regions of accelerated rates of indels in the protein alignments of primate genomes. We also confirm that the sites that have high rates of indels are different from the sites that have high rates of nucleotide substitutions within the protein sequences. By identifying regions with accelerated rates of indels independent of nucleotide substitutions, we will be able to better understand the impact of indel mutations on protein sequence evolution.


Assuntos
Mutação INDEL , Modelos Genéticos , Taxa de Mutação , Animais , Simulação por Computador , Evolução Molecular , Humanos , Nucleotídeos/genética , Filogenia , Proteínas/genética , Deleção de Sequência , Software , Especificidade da Espécie
13.
Science ; 347(6217): 1258522, 2015 Jan 02.
Artigo em Inglês | MEDLINE | ID: mdl-25554792

RESUMO

Variation in vectorial capacity for human malaria among Anopheles mosquito species is determined by many factors, including behavior, immunity, and life history. To investigate the genomic basis of vectorial capacity and explore new avenues for vector control, we sequenced the genomes of 16 anopheline mosquito species from diverse locations spanning ~100 million years of evolution. Comparative analyses show faster rates of gene gain and loss, elevated gene shuffling on the X chromosome, and more intron losses, relative to Drosophila. Some determinants of vectorial capacity, such as chemosensory genes, do not show elevated turnover but instead diversify through protein-sequence changes. This dynamism of anopheline genes and genomes may contribute to their flexible capacity to take advantage of new ecological niches, including adapting to humans as primary hosts.


Assuntos
Anopheles/genética , Evolução Molecular , Genoma de Inseto , Insetos Vetores/genética , Malária/transmissão , Animais , Anopheles/classificação , Sequência de Bases , Cromossomos de Insetos/genética , Drosophila/genética , Humanos , Insetos Vetores/classificação , Dados de Sequência Molecular , Filogenia , Alinhamento de Sequência
14.
Mol Biol Evol ; 30(8): 1987-97, 2013 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-23709260

RESUMO

Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are often fragmented and incomplete. Incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and loss often will be misled by such errors and that rates of gene family evolution will be consistently overestimated. Here, we present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among genomes even with low assembly and annotation quality. The method is implemented in the newest version of the software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with extensive simulations and reanalyze several previously published data sets. Our results show that errors in genome annotation do lead to higher inferred rates of gene gain and loss but that CAFE 3 sufficiently accounts for these errors to provide accurate estimates of important evolutionary parameters.


Assuntos
Genoma , Anotação de Sequência Molecular/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Biologia Computacional/métodos , Evolução Molecular , Genômica/métodos , Reprodutibilidade dos Testes
15.
Nat Commun ; 3: 913, 2012 Jun 26.
Artigo em Inglês | MEDLINE | ID: mdl-22735441

RESUMO

Ganoderma lucidum is a widely used medicinal macrofungus in traditional Chinese medicine that creates a diverse set of bioactive compounds. Here we report its 43.3-Mb genome, encoding 16,113 predicted genes, obtained using next-generation sequencing and optical mapping approaches. The sequence analysis reveals an impressive array of genes encoding cytochrome P450s (CYPs), transporters and regulatory proteins that cooperate in secondary metabolism. The genome also encodes one of the richest sets of wood degradation enzymes among all of the sequenced basidiomycetes. In all, 24 physical CYP gene clusters are identified. Moreover, 78 CYP genes are coexpressed with lanosterol synthase, and 16 of these show high similarity to fungal CYPs that specifically hydroxylate testosterone, suggesting their possible roles in triterpenoid biosynthesis. The elucidation of the G. lucidum genome makes this organism a potential model system for the study of secondary metabolic pathways and their regulation in medicinal fungi.


Assuntos
Genoma Fúngico/genética , Reishi/genética , Proteínas Fúngicas/genética , Reishi/metabolismo
16.
Fly (Austin) ; 6(2): 121-5, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22634624

RESUMO

Genes occasionally change their location in the genome through inter-chromosomal duplication and loss. These changes happen as mistakes during recombination or through retrotransposition. In Han and Hahn 2011,(1) we surveyed the genomes of ten Drosophila species, to identify and characterize the gene transposition events in the history of these species. In the paper, we showed that the rate of gene transposition in Drosophila is higher than previously appreciated. To understand the process of gene transposition, we examined the sequences, locations, and functions of the transposed genes. Based on the elevated rate of sequence evolution in transposed genes and the frequent movements near the centromeres and telomeres, we could not reject the hypothesis that these are mutations fixed through relaxed selection. But, by examining the functions of transposed genes more carefully, we found that genes with male-specific functions and genes with female-specific functions move in opposite directions involving the X chromosome. We also found an over-representation of chromosome related functions among the transposed genes. These observations suggest the possibility of particular selection pressures contributing to gene transpositions in Drosophila.


Assuntos
Cromossomos de Insetos , Drosophila/genética , Rearranjo Gênico , Genes de Insetos , Animais , Feminino , Masculino
17.
Genetics ; 190(2): 813-25, 2012 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-22095076

RESUMO

Gene transposition puts a new gene copy in a novel genomic environment. Moreover, genes moving between the autosomes and the X chromosome experience change in several evolutionary parameters. Previous studies of gene transposition have not utilized the phylogenetic framework that becomes possible with the availability of whole genomes from multiple species. Here we used parsimonious reconstruction on the genomic distribution of gene families to analyze interchromosomal gene transposition in Drosophila. We identified 782 genes that have moved chromosomes within the phylogeny of 10 Drosophila species, including 87 gene families with multiple independent movements on different branches of the phylogeny. Using this large catalog of transposed genes, we detected accelerated sequence evolution in duplicated genes that transposed when compared to the parental copy at the original locus. We also observed a more refined picture of the biased movement of genes from the X chromosome to the autosomes. The bias of X-to-autosome movement was significantly stronger for RNA-based movements than for DNA-based movements, and among DNA-based movements there was an excess of genes moving onto the X chromosome as well. Genes involved in female-specific functions moved onto the X chromosome while genes with male-specific functions moved off the X. There was a significant overrepresentation of proteins involving chromosomal function among transposed genes, suggesting that genetic conflict between sexes and among chromosomes may be a driving force behind gene transposition in Drosophila.


Assuntos
Cromossomos de Insetos , Elementos de DNA Transponíveis , Drosophila/genética , Genes de Insetos , Animais , Segregação de Cromossomos , Feminino , Duplicação Gênica , Genoma de Inseto , Masculino , Recombinação Genética
18.
Evolution ; 65(1): 231-45, 2011 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-20731717

RESUMO

Developmental mechanisms play an important role in determining the costs, limits, and evolutionary consequences of phenotypic plasticity. One issue central to these claims is the hypothesis of developmental decoupling, where alternate morphs result from evolutionarily independent developmental pathways. We address this assumption through a microarray study that tests whether differences in gene expression between alternate morphs are as divergent as those between sexes, a classic example of developmental decoupling. We then examine whether genes with morph-biased expression are less conserved than genes with shared expression between morphs, as predicted if developmental decoupling relaxes pleiotropic constraints on divergence. We focus on the developing horns and brains of two species of horned beetles with impressive sexual- and morph-dimorphism in the expression of horns and fighting behavior. We find that patterns of gene expression were as divergent between morphs as they were between sexes. However, overall patterns of gene expression were also highly correlated across morphs and sexes. Morph-biased genes were more evolutionarily divergent, suggesting a role of relaxed pleiotropic constraints or relaxed selection. Together these results suggest that alternate morphs are to some extent developmentally decoupled, and that this decoupling has significant evolutionary consequences. However, alternative morphs may not be as developmentally decoupled as sometimes assumed and such hypotheses of development should be revisited and refined.


Assuntos
Besouros/anatomia & histologia , Besouros/genética , Animais , Evolução Biológica , Besouros/classificação , Besouros/crescimento & desenvolvimento , Feminino , Perfilação da Expressão Gênica , Regulação da Expressão Gênica no Desenvolvimento , Pleiotropia Genética , Havaí , Masculino , Fenótipo , Filogenia , Caracteres Sexuais , Virginia
19.
Evolution ; 64(6): 1541-57, 2010 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-20298429

RESUMO

The two "rules of speciation"--the Large X-effect and Haldane's rule--hold throughout the animal kingdom, but the underlying genetic mechanisms that cause them are still unclear. Two predominant explanations--the "dominance theory" and faster male evolution--both have some empirical support, suggesting that the genetic basis of these rules is likely multifarious. We revisit one historical explanation for these rules, based on dysfunctional genetic interactions involving genes recently moved between chromosomes. We suggest that gene movement specifically off or onto the X chromosome is another mechanism that could contribute to the two rules, especially as X chromosome movements can be subject to unique sex-specific and sex chromosome specific consequences in hybrids. Our hypothesis is supported by patterns emerging from comparative genomic data, including a strong bias in interchromosomal gene movements involving the X and an overrepresentation of male reproductive functions among chromosomally relocated genes. In addition, our model indicates that the contribution of gene movement to the two rules in any specific group will depend upon key developmental and reproductive parameters that are taxon specific. We provide several testable predictions that can be used to assess the importance of gene movement as a contributor to these rules in the future.


Assuntos
Genes , Especiação Genética , Cromossomo X , Animais , Feminino , Masculino
20.
BMC Bioinformatics ; 10: 356, 2009 Oct 27.
Artigo em Inglês | MEDLINE | ID: mdl-19860910

RESUMO

BACKGROUND: Evolutionary trees are central to a wide range of biological studies. In many of these studies, tree nodes and branches need to be associated (or annotated) with various attributes. For example, in studies concerned with organismal relationships, tree nodes are associated with taxonomic names, whereas tree branches have lengths and oftentimes support values. Gene trees used in comparative genomics or phylogenomics are usually annotated with taxonomic information, genome-related data, such as gene names and functional annotations, as well as events such as gene duplications, speciations, or exon shufflings, combined with information related to the evolutionary tree itself. The data standards currently used for evolutionary trees have limited capacities to incorporate such annotations of different data types. RESULTS: We developed a XML language, named phyloXML, for describing evolutionary trees, as well as various associated data items. PhyloXML provides elements for commonly used items, such as branch lengths, support values, taxonomic names, and gene names and identifiers. By using "property" elements, phyloXML can be adapted to novel and unforeseen use cases. We also developed various software tools for reading, writing, conversion, and visualization of phyloXML formatted data. CONCLUSION: PhyloXML is an XML language defined by a complete schema in XSD that allows storing and exchanging the structures of evolutionary trees as well as associated data. More information about phyloXML itself, the XSD schema, as well as tools implementing and supporting phyloXML, is available at http://www.phyloxml.org.


Assuntos
Evolução Biológica , Biologia Computacional/métodos , Genômica/métodos , Filogenia , Software , Bases de Dados Genéticas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...