Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
Más filtros

País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 37(2): 162-170, 2021 04 19.
Artículo en Inglés | MEDLINE | ID: mdl-32797179

RESUMEN

MOTIVATION: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. RESULTS: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. AVAILABILITY AND IMPLEMENTATION: Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Proteínas , Programas Informáticos , Secuencia de Aminoácidos , Redes Neurales de la Computación , Proteínas/genética
2.
Bioinformatics ; 36(4): 1182-1190, 2020 02 15.
Artículo en Inglés | MEDLINE | ID: mdl-31562759

RESUMEN

MOTIVATION: Co-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, when using RNA-Seq datasets with many experimental conditions from diverse sources, only a subset of the experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similarly functioning genes, the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest. RESULTS: To address this, we developed Metric Learning for Co-expression (MLC), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance. Moreover, our method is particularly good at more specific terms, which are the most interesting. Finally, by observing the sample weights for a particular GO term, one can identify which experiments are important for learning that term and potentially identify novel conditions that are relevant, as demonstrated by experiments in both A. thaliana and Pseudomonas Aeruginosa. AVAILABILITY AND IMPLEMENTATION: MLC is available as a Python package at www.github.com/stamakro/MLC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , RNA-Seq , Ontología de Genes , Fenotipo
3.
Bioinformatics ; 35(7): 1116-1124, 2019 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-30169569

RESUMEN

MOTIVATION: Most automatic functional annotation methods assign Gene Ontology (GO) terms to proteins based on annotations of highly similar proteins. We advocate that proteins that are less similar are still informative. Also, despite their simplicity and structure, GO terms seem to be hard for computers to learn, in particular the Biological Process ontology, which has the most terms (>29 000). We propose to use Label-Space Dimensionality Reduction (LSDR) techniques to exploit the redundancy of GO terms and transform them into a more compact latent representation that is easier to predict. RESULTS: We compare proteins using a sequence similarity profile (SSP) to a set of annotated training proteins. We introduce two new LSDR methods, one based on the structure of the GO, and one based on semantic similarity of terms. We show that these LSDR methods, as well as three existing ones, improve the Critical Assessment of Functional Annotation performance of several function prediction algorithms. Cross-validation experiments on Arabidopsis thaliana proteins pinpoint the superiority of our GO-aware LSDR over generic LSDR. Our experiments on A.thaliana proteins show that the SSP representation in combination with a kNN classifier outperforms state-of-the-art and baseline methods in terms of cross-validated F-measure. AVAILABILITY AND IMPLEMENTATION: Source code for the experiments is available at https://github.com/stamakro/SSP-LSDR. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Biología Computacional , Programas Informáticos , Algoritmos , Secuencia de Aminoácidos , Ontología de Genes , Anotación de Secuencia Molecular
4.
Plant J ; 80(1): 136-48, 2014 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-25039268

RESUMEN

We explored genetic variation by sequencing a selection of 84 tomato accessions and related wild species representative of the Lycopersicon, Arcanum, Eriopersicon and Neolycopersicon groups, which has yielded a huge amount of precious data on sequence diversity in the tomato clade. Three new reference genomes were reconstructed to support our comparative genome analyses. Comparative sequence alignment revealed group-, species- and accession-specific polymorphisms, explaining characteristic fruit traits and growth habits in the various cultivars. Using gene models from the annotated Heinz 1706 reference genome, we observed differences in the ratio between non-synonymous and synonymous SNPs (dN/dS) in fruit diversification and plant growth genes compared to a random set of genes, indicating positive selection and differences in selection pressure between crop accessions and wild species. In wild species, the number of single-nucleotide polymorphisms (SNPs) exceeds 10 million, i.e. 20-fold higher than found in most of the crop accessions, indicating dramatic genetic erosion of crop and heirloom tomatoes. In addition, the highest levels of heterozygosity were found for allogamous self-incompatible wild species, while facultative and autogamous self-compatible species display a lower heterozygosity level. Using whole-genome SNP information for maximum-likelihood analysis, we achieved complete tree resolution, whereas maximum-likelihood trees based on SNPs from ten fruit and growth genes show incomplete resolution for the crop accessions, partly due to the effect of heterozygous SNPs. Finally, results suggest that phylogenetic relationships are correlated with habitat, indicating the occurrence of geographical races within these groups, which is of practical importance for Solanum genome evolution studies.


Asunto(s)
Variación Genética , Genoma de Planta/genética , Solanum lycopersicum/genética , Cruzamiento , Mapeo Cromosómico , ADN de Plantas/química , ADN de Plantas/genética , Frutas/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Datos de Secuencia Molecular , Fenotipo , Filogenia , Polimorfismo de Nucleótido Simple , Alineación de Secuencia , Análisis de Secuencia de ADN , Especificidad de la Especie
5.
BMC Genomics ; 16: 374, 2015 May 10.
Artículo en Inglés | MEDLINE | ID: mdl-25958312

RESUMEN

BACKGROUND: In flowering plants it has been shown that de novo genome assemblies of different species and genera show a significant drop in the proportion of alignable sequence. Within a plant species, however, it is assumed that different haplotypes of the same chromosome align well. In this paper we have compared three de novo assemblies of potato chromosome 5 and report on the sequence variation and the proportion of sequence that can be aligned. RESULTS: For the diploid potato clone RH89-039-16 (RH) we produced two linkage phase controlled and haplotype-specific assemblies of chromosome 5 based on BAC-by-BAC sequencing, which were aligned to each other and compared to the 52 Mb chromosome 5 reference sequence of the doubled monoploid clone DM 1-3 516 R44 (DM). We identified 17.0 Mb of non-redundant sequence scaffolds derived from euchromatic regions of RH and 38.4 Mb from the pericentromeric heterochromatin. For 32.7 Mb of the RH sequences the correct position and order on chromosome 5 was determined, using genetic markers, fluorescence in situ hybridisation and alignment to the DM reference genome. This ordered fraction of the RH sequences is situated in the euchromatic arms and in the heterochromatin borders. In the euchromatic regions, the sequence collinearity between the three chromosomal homologs is good, but interruption of collinearity occurs at nine gene clusters. Towards and into the heterochromatin borders, absence of collinearity due to structural variation was more extensive and was caused by hemizygous and poorly aligning regions of up to 450 kb in length. In the most central heterochromatin, a total of 22.7 Mb sequence from both RH haplotypes remained unordered. These RH sequences have very few syntenic regions and represent a non-alignable region between the RH and DM heterochromatin haplotypes of chromosome 5. CONCLUSIONS: Our results show that among homologous potato chromosomes large regions are present with dramatic loss of sequence collinearity. This stresses the need for more de novo reference assemblies in order to capture genome diversity in this crop. The discovery of three highly diverged pericentric heterochromatin haplotypes within one species is a novelty in plant genome analysis. The possible origin and cytogenetic implication of this heterochromatin haplotype diversity are discussed.


Asunto(s)
Cromosomas de las Plantas , Eucromatina/genética , Heterocromatina/genética , Solanum tuberosum/genética , Mapeo Cromosómico , Cromosomas Artificiales Bacterianos , Eucromatina/metabolismo , Ligamiento Genético , Genotipo , Haplotipos , Heterocromatina/metabolismo , Hibridación Fluorescente in Situ , Polimorfismo Genético
6.
PLoS Genet ; 8(11): e1003088, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-23209441

RESUMEN

We sequenced and compared the genomes of the Dothideomycete fungal plant pathogens Cladosporium fulvum (Cfu) (syn. Passalora fulva) and Dothistroma septosporum (Dse) that are closely related phylogenetically, but have different lifestyles and hosts. Although both fungi grow extracellularly in close contact with host mesophyll cells, Cfu is a biotroph infecting tomato, while Dse is a hemibiotroph infecting pine. The genomes of these fungi have a similar set of genes (70% of gene content in both genomes are homologs), but differ significantly in size (Cfu >61.1-Mb; Dse 31.2-Mb), which is mainly due to the difference in repeat content (47.2% in Cfu versus 3.2% in Dse). Recent adaptation to different lifestyles and hosts is suggested by diverged sets of genes. Cfu contains an α-tomatinase gene that we predict might be required for detoxification of tomatine, while this gene is absent in Dse. Many genes encoding secreted proteins are unique to each species and the repeat-rich areas in Cfu are enriched for these species-specific genes. In contrast, conserved genes suggest common host ancestry. Homologs of Cfu effector genes, including Ecp2 and Avr4, are present in Dse and induce a Cf-Ecp2- and Cf-4-mediated hypersensitive response, respectively. Strikingly, genes involved in production of the toxin dothistromin, a likely virulence factor for Dse, are conserved in Cfu, but their expression differs markedly with essentially no expression by Cfu in planta. Likewise, Cfu has a carbohydrate-degrading enzyme catalog that is more similar to that of necrotrophs or hemibiotrophs and a larger pectinolytic gene arsenal than Dse, but many of these genes are not expressed in planta or are pseudogenized. Overall, comparison of their genomes suggests that these closely related plant pathogens had a common ancestral host but since adapted to different hosts and lifestyles by a combination of differentiated gene content, pseudogenization, and gene regulation.


Asunto(s)
Adaptación Fisiológica/genética , Cladosporium/genética , Genoma , Interacciones Huésped-Patógeno , Secuencia de Bases , Proteínas Fúngicas/genética , Regulación Fúngica de la Expresión Génica , Solanum lycopersicum/genética , Solanum lycopersicum/parasitología , Filogenia , Pinus/genética , Pinus/parasitología , Enfermedades de las Plantas/genética
7.
PLoS Genet ; 7(6): e1002070, 2011 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-21695235

RESUMEN

The plant-pathogenic fungus Mycosphaerella graminicola (asexual stage: Septoria tritici) causes septoria tritici blotch, a disease that greatly reduces the yield and quality of wheat. This disease is economically important in most wheat-growing areas worldwide and threatens global food production. Control of the disease has been hampered by a limited understanding of the genetic and biochemical bases of pathogenicity, including mechanisms of infection and of resistance in the host. Unlike most other plant pathogens, M. graminicola has a long latent period during which it evades host defenses. Although this type of stealth pathogenicity occurs commonly in Mycosphaerella and other Dothideomycetes, the largest class of plant-pathogenic fungi, its genetic basis is not known. To address this problem, the genome of M. graminicola was sequenced completely. The finished genome contains 21 chromosomes, eight of which could be lost with no visible effect on the fungus and thus are dispensable. This eight-chromosome dispensome is dynamic in field and progeny isolates, is different from the core genome in gene and repeat content, and appears to have originated by ancient horizontal transfer from an unknown donor. Synteny plots of the M. graminicola chromosomes versus those of the only other sequenced Dothideomycete, Stagonospora nodorum, revealed conservation of gene content but not order or orientation, suggesting a high rate of intra-chromosomal rearrangement in one or both species. This observed "mesosynteny" is very different from synteny seen between other organisms. A surprising feature of the M. graminicola genome compared to other sequenced plant pathogens was that it contained very few genes for enzymes that break down plant cell walls, which was more similar to endophytes than to pathogens. The stealth pathogenesis of M. graminicola probably involves degradation of proteins rather than carbohydrates to evade host defenses during the biotrophic stage of infection and may have evolved from endophytic ancestors.


Asunto(s)
Ascomicetos/genética , Cromosomas Fúngicos/genética , Genoma Fúngico/genética , Ascomicetos/metabolismo , Ascomicetos/patogenicidad , Reordenamiento Génico , Enfermedades de las Plantas/microbiología , Sintenía , Triticum/microbiología
8.
Nucleic Acids Res ; 39(Web Server issue): W524-7, 2011 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-21609962

RESUMEN

Although several tools for the analysis of ChIP-seq data have been published recently, there is a growing demand, in particular in the plant research community, for computational resources with which such data can be processed, analyzed, stored, visualized and integrated within a single, user-friendly environment. To accommodate this demand, we have developed PRI-CAT (Plant Research International ChIP-seq analysis tool), a web-based workflow tool for the management and analysis of ChIP-seq experiments. PRI-CAT is currently focused on Arabidopsis, but will be extended with other plant species in the near future. Users can directly submit their sequencing data to PRI-CAT for automated analysis. A QuickLoad server compatible with genome browsers is implemented for the storage and visualization of DNA-binding maps. Submitted datasets and results can be made publicly available through PRI-CAT, a feature that will enable community-based integrative analysis and visualization of ChIP-seq experiments. Secondary analysis of data can be performed with the aid of GALAXY, an external framework for tool and data integration. PRI-CAT is freely available at http://www.ab.wur.nl/pricat. No login is required.


Asunto(s)
Arabidopsis/genética , Inmunoprecipitación de Cromatina/métodos , Proteínas de Plantas/metabolismo , Programas Informáticos , Factores de Transcripción/metabolismo , Sitios de Unión , Gráficos por Computador , Proteínas de Unión al ADN/metabolismo , Secuenciación de Nucleótidos de Alto Rendimiento , Internet , Regiones Promotoras Genéticas
10.
Plant Physiol ; 155(1): 271-81, 2011 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-21098674

RESUMEN

Although Arabidopsis (Arabidopsis thaliana) is the best studied plant species, the biological role of one-third of its proteins is still unknown. We developed a probabilistic protein function prediction method that integrates information from sequences, protein-protein interactions, and gene expression. The method was applied to proteins from Arabidopsis. Evaluation of prediction performance showed that our method has improved performance compared with single source-based prediction approaches and two existing integration approaches. An innovative feature of our method is that it enables transfer of functional information between proteins that are not directly associated with each other. We provide novel function predictions for 5,807 proteins. Recent experimental studies confirmed several of the predictions. We highlight these in detail for proteins predicted to be involved in flowering and floral organ development.


Asunto(s)
Proteínas de Arabidopsis/genética , Proteínas de Arabidopsis/metabolismo , Arabidopsis/genética , Biología Computacional/métodos , Bases de Datos Genéticas , Genoma de Planta/genética , Animales , Área Bajo la Curva , Teorema de Bayes , Flores/embriología , Flores/genética , Cadenas de Markov , Modelos Genéticos , Anotación de Secuencia Molecular , Organogénesis/genética , Reproducibilidad de los Resultados
11.
BMC Bioinformatics ; 12: 444, 2011 Nov 14.
Artículo en Inglés | MEDLINE | ID: mdl-22082126

RESUMEN

BACKGROUND: In addition to sequence conservation, protein multiple sequence alignments contain evolutionary signal in the form of correlated variation among amino acid positions. This signal indicates positions in the sequence that influence each other, and can be applied for the prediction of intra- or intermolecular contacts. Although various approaches exist for the detection of such correlated mutations, in general these methods utilize only pairwise correlations. Hence, they tend to conflate direct and indirect dependencies. RESULTS: We propose RMRCM, a method for Regularized Multinomial Regression in order to obtain Correlated Mutations from protein multiple sequence alignments. Importantly, our method is not restricted to pairwise (column-column) comparisons only, but takes into account the network nature of relationships between protein residues in order to predict residue-residue contacts. The use of regularization ensures that the number of predicted links between columns in the multiple sequence alignment remains limited, preventing overprediction. Using simulated datasets we analyzed the performance of our approach in predicting residue-residue contacts, and studied how it is influenced by various types of noise. For various biological datasets, validation with protein structure data indicates a good performance of the proposed algorithm for the prediction of residue-residue contacts, in comparison to previous results. RMRCM can also be applied to predict interactions (in addition to only predicting interaction sites or contact sites), as demonstrated by predicting PDZ-peptide interactions. CONCLUSIONS: A novel method is presented, which uses regularized multinomial regression in order to obtain correlated mutations from protein multiple sequence alignments. AVAILABILITY: R-code of our implementation is available via http://www.ab.wur.nl/rmrcm.


Asunto(s)
Algoritmos , Mutación , Análisis de Regresión , Secuencia de Aminoácidos , Arabidopsis/metabolismo , Proteínas de Arabidopsis/química , Proteínas de Arabidopsis/genética , Proteínas de Arabidopsis/metabolismo , Secuencia Conservada , Proteínas de Dominio MADS/química , Proteínas de Dominio MADS/genética , Proteínas de Dominio MADS/metabolismo , Modelos Moleculares , Mapas de Interacción de Proteínas , Análisis de Secuencia de Proteína
12.
Trends Genet ; 24(11): 539-51, 2008 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-18819722

RESUMEN

Orthology is a key evolutionary concept in many areas of genomic research. It provides a framework for subjects as diverse as the evolution of genomes, gene functions, cellular networks and functional genome annotation. Although orthologous proteins usually perform equivalent functions in different species, establishing true orthologous relationships requires a phylogenetic approach, which combines both trees and graphs (networks) using reliable species phylogeny and available genomic data from more than two species, and an insight into the processes of molecular evolution. Here, we evaluate the available bioinformatics tools and provide a set of guidelines to aid researchers in choosing the most appropriate tool for any situation.


Asunto(s)
Evolución Molecular , Genómica/métodos , Filogenia , Homología de Secuencia , Animales , Bases de Datos Genéticas , Genoma , Humanos , Proteínas/química
13.
BMC Plant Biol ; 11(1): 82, 2011 May 16.
Artículo en Inglés | MEDLINE | ID: mdl-21575182

RESUMEN

BACKGROUND: Large-scale analyses of genomics and transcriptomics data have revealed that alternative splicing (AS) substantially increases the complexity of the transcriptome in higher eukaryotes. However, the extent to which this complexity is reflected at the level of the proteome remains unclear. On the basis of a lack of conservation of AS between species, we previously concluded that AS does not frequently serve as a mechanism that enables the production of multiple functional proteins from a single gene. Following this conclusion, we hypothesized that the extent to which AS events contribute to the proteome diversity in Arabidopsis thaliana would be lower than expected on the basis of transcriptomics data. Here, we test this hypothesis by analyzing two large-scale proteomics datasets from Arabidopsis thaliana. RESULTS: A total of only 60 AS events could be confirmed using the proteomics data. However, for about 60% of the loci that, based on transcriptomics data, were predicted to produce multiple protein isoforms through AS, no isoform-specific peptides were found. We therefore performed in silico AS detection experiments to assess how well AS events were represented in the experimental datasets. The results of these in silico experiments indicated that the low number of confirmed AS events was the consequence of a limited sampling depth rather than in vivo under-representation of AS events in these datasets. CONCLUSION: Although the impact of AS on the functional properties of the proteome remains to be uncovered, the results of this study indicate that AS-induced diversity at the transcriptome level is also expressed at the proteome level.


Asunto(s)
Proteínas de Arabidopsis/genética , Arabidopsis/genética , Proteoma/genética , Empalme Alternativo , Arabidopsis/metabolismo , Proteínas de Arabidopsis/química , Proteínas de Arabidopsis/metabolismo , Hibridación Genómica Comparativa , ADN de Plantas , Perfilación de la Expresión Génica , Regulación de la Expresión Génica de las Plantas , Genes de Plantas , Genoma de Planta , Genómica , Polimorfismo Genético , Isoformas de Proteínas , Proteoma/metabolismo , Proteómica/métodos
14.
PLoS Comput Biol ; 6(11): e1001017, 2010 Nov 24.
Artículo en Inglés | MEDLINE | ID: mdl-21124869

RESUMEN

Protein sequences encompass tertiary structures and contain information about specific molecular interactions, which in turn determine biological functions of proteins. Knowledge about how protein sequences define interaction specificity is largely missing, in particular for paralogous protein families with high sequence similarity, such as the plant MADS domain transcription factor family. In comparison to the situation in mammalian species, this important family of transcription regulators has expanded enormously in plant species and contains over 100 members in the model plant species Arabidopsis thaliana. Here, we provide insight into the mechanisms that determine protein-protein interaction specificity for the Arabidopsis MADS domain transcription factor family, using an integrated computational and experimental approach. Plant MADS proteins have highly similar amino acid sequences, but their dimerization patterns vary substantially. Our computational analysis uncovered small sequence regions that explain observed differences in dimerization patterns with reasonable accuracy. Furthermore, we show the usefulness of the method for prediction of MADS domain transcription factor interaction networks in other plant species. Introduction of mutations in the predicted interaction motifs demonstrated that single amino acid mutations can have a large effect and lead to loss or gain of specific interactions. In addition, various performed bioinformatics analyses shed light on the way evolution has shaped MADS domain transcription factor interaction specificity. Identified protein-protein interaction motifs appeared to be strongly conserved among orthologs, indicating their evolutionary importance. We also provide evidence that mutations in these motifs can be a source for sub- or neo-functionalization. The analyses presented here take us a step forward in understanding protein-protein interactions and the interplay between protein sequences and network evolution.


Asunto(s)
Secuencias de Aminoácidos , Proteínas de Dominio MADS/química , Dominios y Motivos de Interacción de Proteínas , Mapeo de Interacción de Proteínas/métodos , Secuencia de Aminoácidos , Proteínas de Arabidopsis/química , Proteínas de Arabidopsis/genética , Proteínas de Arabidopsis/metabolismo , Bases de Datos de Proteínas , Evolución Molecular , Proteínas de Dominio MADS/genética , Proteínas de Dominio MADS/metabolismo , Modelos Moleculares , Modelos Estadísticos , Datos de Secuencia Molecular , Mutación , Reproducibilidad de los Resultados , Alineación de Secuencia
15.
Plant J ; 58(5): 857-69, 2009 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-19207213

RESUMEN

We studied the physical and genetic organization of chromosome 6 of tomato (Solanum lycopersicum) cv. Heinz 1706 by combining bacterial artificial chromosome (BAC) sequence analysis, high-information-content fingerprinting, genetic analysis, and BAC-fluorescent in situ hybridization (FISH) mapping data. The chromosome positions of 81 anchored seed and extension BACs corresponded in most cases with the linear marker order on the high-density EXPEN 2000 linkage map. We assembled 25 BAC contigs and eight singleton BACs spanning 2.0 Mb of the short-arm euchromatin, 1.8 Mb of the pericentromeric heterochromatin and 6.9 Mb of the long-arm euchromatin. Sequence data were combined with their corresponding genetic and pachytene chromosome positions into an integrated map that covers approximately a third of the chromosome 6 euchromatin and a small part of the pericentromeric heterochromatin. We then compared physical length (Mb), genetic (cM) and chromosome distances (microm) for determining gap sizes between contigs, revealing relative hot and cold spots of recombination. Through sequence annotation we identified several clusters of functionally related genes and an uneven distribution of both gene and repeat sequences between heterochromatin and euchromatin domains. Although a greater number of the non-transposon genes were located in the euchromatin, the highly repetitive (22.4%) pericentromeric heterochromatin displayed an unexpectedly high gene content of one gene per 36.7 kb. Surprisingly, the short-arm euchromatin was relatively rich in repeats as well, with a repeat content of 13.4%, yet the ratio of Ty3/Gypsy and Ty1/Copia retrotransposable elements across the chromosome clearly distinguished euchromatin (2:3) from heterochromatin (3:2).


Asunto(s)
Cromosomas de las Plantas/genética , Genes de Plantas , Retroelementos , Solanum lycopersicum/genética , Paseo de Cromosoma , Cromosomas Artificiales Bacterianos , Mapeo Contig , Dermatoglifia del ADN , ADN de Plantas/genética , Eucromatina , Heterocromatina , Hibridación Fluorescente in Situ , Análisis de Secuencia de ADN
16.
BMC Genomics ; 11: 607, 2010 Oct 28.
Artículo en Inglés | MEDLINE | ID: mdl-20979667

RESUMEN

BACKGROUND: Plant MADS domain proteins are involved in a variety of developmental processes for which their ability to form various interactions is a key requisite. However, not much is known about the structure of these proteins or their complexes, whereas such knowledge would be valuable for a better understanding of their function. Here, we analyze those proteins and the complexes they form using a correlated mutation approach in combination with available structural, bioinformatics and experimental data. RESULTS: Correlated mutations are affected by several types of noise, which is difficult to disentangle from the real signal. In our analysis of the MADS domain proteins, we apply for the first time a correlated mutation analysis to a family of interacting proteins. This provides a unique way to investigate the amount of signal that is present in correlated mutations because it allows direct comparison of mutations in various family members and assessing their conservation. We show that correlated mutations in general are conserved within the various family members, and if not, the variability at the respective positions is less in the proteins in which the correlated mutation does not occur. Also, intermolecular correlated mutation signals for interacting pairs of proteins display clear overlap with other bioinformatics data, which is not the case for non-interacting protein pairs, an observation which validates the intermolecular correlated mutations. Having validated the correlated mutation results, we apply them to infer the structural organization of the MADS domain proteins. CONCLUSION: Our analysis enables understanding of the structural organization of the MADS domain proteins, including support for predicted helices based on correlated mutation patterns, and evidence for a specific interaction site in those proteins.


Asunto(s)
Secuencia Conservada/genética , Proteínas de Dominio MADS/genética , Mutación/genética , Proteínas de Plantas/genética , Secuencia de Bases , Análisis Mutacional de ADN , Proteínas de Dominio MADS/química , Datos de Secuencia Molecular , Proteínas de Plantas/química , Polimorfismo de Nucleótido Simple/genética , Unión Proteica , Estructura Secundaria de Proteína , Estructura Terciaria de Proteína , Reproducibilidad de los Resultados
17.
Genes (Basel) ; 11(11)2020 10 27.
Artículo en Inglés | MEDLINE | ID: mdl-33120976

RESUMEN

The current rate at which new DNA and protein sequences are being generated is too fast to experimentally discover the functions of those sequences, emphasizing the need for accurate Automatic Function Prediction (AFP) methods. AFP has been an active and growing research field for decades and has made considerable progress in that time. However, it is certainly not solved. In this paper, we describe challenges that the AFP field still has to overcome in the future to increase its applicability. The challenges we consider are how to: (1) include condition-specific functional annotation, (2) predict functions for non-model species, (3) include new informative data sources, (4) deal with the biases of Gene Ontology (GO) annotations, and (5) maximally exploit the GO to obtain performance gains. We also provide recommendations for addressing those challenges, by adapting (1) the way we represent proteins and genes, (2) the way we represent gene functions, and (3) the algorithms that perform the prediction from gene to function. Together, we show that AFP is still a vibrant research area that can benefit from continuing advances in machine learning with which AFP in the 2020s can again take a large step forward reinforcing the power of computational biology.


Asunto(s)
Biología Computacional/métodos , Ontología de Genes , Anotación de Secuencia Molecular/métodos , Proteínas/metabolismo , Algoritmos , Secuencia de Aminoácidos/genética , Procesamiento Automatizado de Datos , Aprendizaje Automático , Modelos Biológicos , Proteínas/genética
18.
Plant J ; 56(4): 627-37, 2008 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-18643986

RESUMEN

Within the framework of the International Solanaceae Genome Project, the genome of tomato (Solanum lycopersicum) is currently being sequenced. We follow a 'BAC-by-BAC' approach that aims to deliver high-quality sequences of the euchromatin part of the tomato genome. BACs are selected from various libraries of the tomato genome on the basis of markers from the F2.2000 linkage map. Prior to sequencing, we validated the precise physical location of the selected BACs on the chromosomes by five-colour high-resolution fluorescent in situ hybridization (FISH) mapping. This paper describes the strategies and results of cytogenetic mapping for chromosome 6 using 75 seed BACs for FISH on pachytene complements. The cytogenetic map obtained showed discrepancies between the actual chromosomal positions of these BACs and their markers on the linkage group. These discrepancies were most notable in the pericentromere heterochromatin, thus confirming previously described suppression of cross-over recombination in that region. In a so called pooled-BAC FISH, we hybridized all seed BACs simultaneously and found a few large gaps in the euchromatin parts of the long arm that are still devoid of seed BACs and are too large for coverage by expanding BAC contigs. Combining FISH with pooled BACs and newly recruited seed BACs will thus aid in efficient targeting of novel seed BACs into these areas. Finally, we established the occurrence of repetitive DNA in heterochromatin/euchromatin borders by combining BAC FISH with hybridization of a labelled repetitive DNA fraction (Cot-100). This strategy provides an excellent means to establish the borders between euchromatin and heterochromatin in this chromosome.


Asunto(s)
Mapeo Cromosómico/métodos , Cromosomas Artificiales Bacterianos , Cromosomas de las Plantas , Hibridación Fluorescente in Situ/métodos , Solanum lycopersicum/genética , ADN de Plantas/genética , Eucromatina , Marcadores Genéticos , Genoma de Planta , Heterocromatina , Secuencias Repetitivas de Ácidos Nucleicos
19.
BMC Genomics ; 10: 204, 2009 Apr 30.
Artículo en Inglés | MEDLINE | ID: mdl-19405940

RESUMEN

BACKGROUND: MicroRNAs (miRNAs), short approximately 21-nucleotide RNA molecules, play an important role in post-transcriptional regulation of gene expression. The number of known miRNA hairpins registered in the miRBase database is rapidly increasing, but recent reports suggest that many miRNAs with restricted temporal or tissue-specific expression remain undiscovered. Various strategies for in silico miRNA identification have been proposed to facilitate miRNA discovery. Notably support vector machine (SVM) methods have recently gained popularity. However, a drawback of these methods is that they do not provide insight into the biological properties of miRNA sequences. RESULTS: We here propose a new strategy for miRNA hairpin prediction in which the likelihood that a genomic hairpin is a true miRNA hairpin is evaluated based on statistical distributions of observed biological variation of properties (descriptors) of known miRNA hairpins. These distributions are transformed into a single and continuous outcome classifier called the L score. Using a dataset of known miRNA hairpins from the miRBase database and an exhaustive set of genomic hairpins identified in the genome of Caenorhabditis elegans, a subset of 18 most informative descriptors was selected after detailed analysis of correlation among and discriminative power of individual descriptors. We show that the majority of previously identified miRNA hairpins have high L scores, that the method outperforms miRNA prediction by threshold filtering and that it is more transparent than SVM classifiers. CONCLUSION: The L score is applicable as a prediction classifier with high sensitivity for novel miRNA hairpins. The L-score approach can be used to rank and select interesting miRNA hairpin candidates for downstream experimental analysis when coupled to a genome-wide set of in silico-identified hairpins or to facilitate the analysis of large sets of putative miRNA hairpin loci obtained in deep-sequencing efforts of small RNAs. Moreover, the in-depth analyses of miRNA hairpins descriptors preceding and determining the L score outcome could be used as an extension to miRBase entries to help increase the reliability and biological relevance of the miRNA registry.


Asunto(s)
Biología Computacional/métodos , MicroARNs/genética , Conformación de Ácido Nucleico , Análisis de Secuencia de ARN/métodos , Animales , Caenorhabditis elegans/genética , Genoma de los Helmintos , Funciones de Verosimilitud , Modelos Genéticos , Sensibilidad y Especificidad
20.
BMC Genomics ; 10: 154, 2009 Apr 09.
Artículo en Inglés | MEDLINE | ID: mdl-19358722

RESUMEN

BACKGROUND: Alternative splicing (AS) is a widespread phenomenon in higher eukaryotes but the extent to which it leads to functional protein isoforms and to proteome expansion at large is still a matter of debate. In contrast to animal species, for which AS has been studied extensively at the protein and functional level, protein-centered studies of AS in plant species are scarce. Here we investigate the functional impact of AS in dicot and monocot plant species using a comparative approach. RESULTS: Detailed comparison of AS events in alternative spliced orthologs from the dicot Arabidopsis thaliana and the monocot Oryza sativa (rice) revealed that the vast majority of AS events in both species do not result from functional conservation. Transcript isoforms that are putative targets for the nonsense-mediated decay (NMD) pathway are as likely to contain conserved AS events as isoforms that are translated into proteins. Similar results were obtained when the same comparison was performed between the two more closely related monocot species rice and Zea mays (maize).Genome-wide computational analysis of functional protein domains encoded in alternatively and constitutively spliced genes revealed that only the RNA recognition motif (RRM) is overrepresented in alternatively spliced genes in all species analyzed. In contrast, three domain types were overrepresented in constitutively spliced genes. AS events were found to be less frequent within than outside predicted protein domains and no domain type was found to be enriched with AS introns. Analysis of AS events that result in the removal of complete protein domains revealed that only a small number of domain types is spliced-out in all species analyzed. Finally, in a substantial fraction of cases where a domain is completely removed, this domain appeared to be a unit of a tandem repeat. CONCLUSION: The results from the ortholog comparisons suggest that the ability of a gene to produce more than one functional protein through AS does not persist during evolution. Cross-species comparison of the results of the protein-domain oriented analyses indicates little correspondence between the analyzed species. Based on the premise that functional genetic features are most likely to be conserved during evolution, we conclude that AS has only a limited role in functional expansion of the proteome in plants.


Asunto(s)
Empalme Alternativo , Arabidopsis/genética , Oryza/genética , Proteoma/genética , Hibridación Genómica Comparativa , ADN de Plantas/genética , Evolución Molecular , Regulación de la Expresión Génica de las Plantas , Genoma de Planta , Proteínas de Plantas/genética , Polimorfismo Genético , Isoformas de Proteínas/genética , Zea mays/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA