Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 4 de 4
Filtrar
Más filtros











Base de datos
Intervalo de año de publicación
1.
NAR Genom Bioinform ; 5(2): lqad031, 2023 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-37035540

RESUMEN

Cleavage and polyadenylation (CPA) sites define eukaryotic gene ends. CPA sites are associated with five key sequence recognition elements: the upstream UGUA, the polyadenylation signal (PAS), and U-rich sequences; the CA/UA dinucleotide where cleavage occurs; and GU-rich downstream elements (DSEs). Currently, it is not clear whether these sequences are sufficient to delineate CPA sites. Additionally, numerous other sequences and factors have been described, often in the context of promoting alternative CPA sites and preventing cryptic CPA site usage. Here, we dissect the contributions of individual sequence features to CPA using standard discriminative models. We show that models comprised only of the five primary CPA sequence features give highest probability scores to constitutive CPA sites at the ends of coding genes, relative to the entire pre-mRNA sequence, for 59% of all human genes. U1-hybridizing sequences provide a small boost in performance. The addition of all known RBP RNA binding motifs to the model increases this figure to only 61%, suggesting that additional factors beyond the core CPA machinery have a minimal role in delineating real from cryptic sites. To our knowledge, this high effectiveness of established features to predict human gene ends has not previously been documented.

2.
NAR Genom Bioinform ; 3(2): lqab042, 2021 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-34104882

RESUMEN

Cleavage and polyadenylation (CPA) sites define eukaryotic gene ends. CPA sites are associated with five key sequence recognition elements: the upstream UGUA, the polyadenylation signal (PAS), and U-rich sequences; the CA/UA dinucleotide where cleavage occurs; and GU-rich downstream elements (DSEs). Currently, it is not clear whether these sequences are sufficient to delineate CPA sites. Additionally, numerous other sequences and factors have been described, often in the context of promoting alternative CPA sites and preventing cryptic CPA site usage. Here, we dissect the contributions of individual sequence features to CPA using standard discriminative models. We show that models comprised only of the five primary CPA sequence features give highest probability scores to constitutive CPA sites at the ends of coding genes, relative to the entire pre-mRNA sequence, for 41% of all human genes. U1-hybridizing sequences provide a small boost in performance. The addition of all known RBP RNA binding motifs to the model, however, increases this figure to 49%, and suggests an involvement of both known and suspected CPA regulators as well as potential new factors in delineating constitutive CPA sites. To our knowledge, this high effectiveness of established features to predict human gene ends has not previously been documented.

3.
Mol Syst Biol ; 16(2): e9243, 2020 02.
Artículo en Inglés | MEDLINE | ID: mdl-32064787

RESUMEN

Our ability to understand the genotype-to-phenotype relationship is hindered by the lack of detailed understanding of phenotypes at a single-cell level. To systematically assess cell-to-cell phenotypic variability, we combined automated yeast genetics, high-content screening and neural network-based image analysis of single cells, focussing on genes that influence the architecture of four subcellular compartments of the endocytic pathway as a model system. Our unbiased assessment of the morphology of these compartments-endocytic patch, actin patch, late endosome and vacuole-identified 17 distinct mutant phenotypes associated with ~1,600 genes (~30% of all yeast genes). Approximately half of these mutants exhibited multiple phenotypes, highlighting the extent of morphological pleiotropy. Quantitative analysis also revealed that incomplete penetrance was prevalent, with the majority of mutants exhibiting substantial variability in phenotype at the single-cell level. Our single-cell analysis enabled exploration of factors that contribute to incomplete penetrance and cellular heterogeneity, including replicative age, organelle inheritance and response to stress.


Asunto(s)
Mutación , Proteínas de Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/crecimiento & desarrollo , Análisis de la Célula Individual/métodos , Pleiotropía Genética , Variación Genética , Microscopía Fluorescente , Redes Neurales de la Computación , Penetrancia , Fenotipo , Saccharomyces cerevisiae/genética , Biología de Sistemas , Imagen de Lapso de Tiempo
4.
Biomed Eng Online ; 16(Suppl 1): 75, 2017 Aug 18.
Artículo en Inglés | MEDLINE | ID: mdl-28830426

RESUMEN

BACKGROUND: Biology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite heterogeneous on its own, and the discrimination of its several sub-families is a challenging problem. In the absence of known crystal structure, such discrimination must rely on their primary amino acid sequences. METHODS: We are interested not as much in achieving maximum sub-family discrimination accuracy using quantitative methods, but in exploring sequence misclassification behavior. Specifically, we are interested in isolating those sequences showing consistent misclassification, that is, sequences that are very often misclassified and almost always to the same wrong sub-family. Random forests are used for this analysis due to their ensemble nature, which makes them naturally suited to gauge the consistency of misclassification. This consistency is here defined through the voting scheme of their base tree classifiers. RESULTS: Detailed consistency results for the random forest ensemble classification were obtained for all receptors and for all data transformations of their unaligned primary sequences. Shortlists of the most consistently misclassified receptors for each subfamily and transformation, as well as an overall shortlist including those cases that were consistently misclassified across transformations, were obtained. The latter should be referred to experts for further investigation as a data curation task. CONCLUSION: The automatic discrimination of the Class C sub-families of G protein-coupled receptors from their unaligned primary sequences shows clear limits. This study has investigated in some detail the consistency of their misclassification using random forest ensemble classifiers. Different sub-families have been shown to display very different discrimination consistency behaviors. The individual identification of consistently misclassified sequences should provide a tool for quality control to GPCR database curators.


Asunto(s)
Biología Computacional/métodos , Bases de Datos de Proteínas , Receptores Acoplados a Proteínas G , Aprendizaje Automático
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA