Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
1.
PLoS Comput Biol ; 20(5): e1012061, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38701099

RESUMEN

To optimize proteins for particular traits holds great promise for industrial and pharmaceutical purposes. Machine Learning is increasingly applied in this field to predict properties of proteins, thereby guiding the experimental optimization process. A natural question is: How much progress are we making with such predictions, and how important is the choice of regressor and representation? In this paper, we demonstrate that different assessment criteria for regressor performance can lead to dramatically different conclusions, depending on the choice of metric, and how one defines generalization. We highlight the fundamental issues of sample bias in typical regression scenarios and how this can lead to misleading conclusions about regressor performance. Finally, we make the case for the importance of calibrated uncertainty in this domain.


Asunto(s)
Biología Computacional , Aprendizaje Automático , Ingeniería de Proteínas , Ingeniería de Proteínas/métodos , Análisis de Regresión , Biología Computacional/métodos , Proteínas/química , Algoritmos
2.
Bioinformatics ; 38(4): 941-946, 2022 01 27.
Artículo en Inglés | MEDLINE | ID: mdl-35088833

RESUMEN

MOTIVATION: Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased. RESULTS: In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences. AVAILABILITY AND IMPLEMENTATION: The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Escherichia coli , Lenguaje , Proteínas , Programas Informáticos , Solubilidad
3.
Sci Rep ; 12(1): 17882, 2022 10 25.
Artículo en Inglés | MEDLINE | ID: mdl-36284144

RESUMEN

The mining of genomes from non-cultivated microorganisms using metagenomics is a powerful tool to discover novel proteins and other valuable biomolecules. However, function-based metagenome searches are often limited by the time-consuming expression of the active proteins in various heterologous host systems. We here report the initial characterization of novel single-subunit bacteriophage RNA polymerase, EM1 RNAP, identified from a metagenome data set obtained from an elephant dung microbiome. EM1 RNAP and its promoter sequence are distantly related to T7 RNA polymerase. Using EM1 RNAP and a translation-competent Escherichia coli extract, we have developed an efficient medium-throughput pipeline and protocol allowing the expression of metagenome-derived genes and the production of proteins in cell-free system is sufficient for the initial testing of the predicted activities. Here, we have successfully identified and verified 12 enzymes acting on bis(2-hydroxyethyl) terephthalate (BHET) in a completely clone-free approach and proposed an in vitro high-throughput metagenomic screening method.


Asunto(s)
Metagenoma , Proteinas del Complejo de Replicasa Viral , Sistema Libre de Células/metabolismo , ARN Viral/metabolismo , ARN Polimerasas Dirigidas por ADN/genética , ARN Polimerasas Dirigidas por ADN/metabolismo , Metagenómica/métodos , Escherichia coli/genética , Escherichia coli/metabolismo
4.
RNA ; 15(11): 2028-34, 2009 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-19745027

RESUMEN

Recently, next-generation sequencing has been introduced as a promising, new platform for assessing the copy number of transcripts, while the existing microarray technology is considered less reliable for absolute, quantitative expression measurements. Nonetheless, so far, results from the two technologies have only been compared based on biological data, leading to the conclusion that, although they are somewhat correlated, expression values differ significantly. Here, we use synthetic RNA samples, resembling human microRNA samples, to find that microarray expression measures actually correlate better with sample RNA content than expression measures obtained from sequencing data. In addition, microarrays appear highly sensitive and perform equivalently to next-generation sequencing in terms of reproducibility and relative ratio quantification.


Asunto(s)
Expresión Génica , MicroARNs/análisis , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Análisis de Secuencia de ARN/métodos , MicroARNs/síntesis química , MicroARNs/genética , Reproducibilidad de los Resultados
5.
Methods ; 50(4): S6-9, 2010 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-20215018

RESUMEN

microRNAs are small regulatory RNAs that are currently emerging as new biomarkers for cancer and other diseases. In order for biomarkers to be useful in clinical settings, they should be accurately and reliably detected in clinical samples such as formalin fixed paraffin embedded (FFPE) sections and blood serum or plasma. These types of samples represent a challenge in terms of microRNA quantification. A newly developed method for microRNA qPCR using Locked Nucleic Acid (LNA)-enhanced primers enables accurate and reproducible quantification of microRNAs in scarce clinical samples. Here we show that LNA-based microRNA qPCR enables biomarker screening using very low amounts of total RNA from FFPE samples and the results are compared to microarray analysis data. We also present evidence that the addition of a small carrier RNA prior to total RNA extraction, improves microRNA quantification in blood plasma and laser capture microdissected (LCM) sections of FFPE samples.


Asunto(s)
MicroARNs/análisis , Reacción en Cadena de la Polimerasa/métodos , Fijadores , Formaldehído , Humanos , Rayos Láser , MicroARNs/sangre , MicroARNs/genética , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Adhesión en Parafina
6.
Comput Biol Chem ; 95: 107596, 2021 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-34775287

RESUMEN

A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimizations. However, such strategies are time-consuming, and an alternative strategy would be to select genes for better compatibility with the recombinant host. Several methods for predicting soluble expression are available; however, they are all optimized for the expression host Escherichia coli and do not consider the possibility of an expressed protein not being soluble. We show that these tools are not suited for predicting expression potential in the industrially important host Bacillus subtilis. Instead, we build a B. subtilis-specific machine learning model for expressibility prediction. Given millions of unlabelled proteins and a small labeled dataset, we can successfully train such a predictive model. The unlabeled proteins provide a performance boost relative to using amino acid frequencies of the labeled proteins as input. On average, we obtain a modest performance of 0.64 area-under-the-curve (AUC) and 0.2 Matthews correlation coefficient (MCC). However, we find that this is sufficient for the prioritization of expression candidates for high-throughput studies. Moreover, the predicted class probabilities are correlated with expression levels. A number of features related to protein expression, including base frequencies and solubility, are captured by the model.


Asunto(s)
Bacillus subtilis/genética , Proteínas Bacterianas/genética , Aprendizaje Automático , Regulación de la Expresión Génica , Proteínas Recombinantes/genética
7.
BMC Bioinformatics ; 7: 501, 2006 Nov 14.
Artículo en Inglés | MEDLINE | ID: mdl-17105666

RESUMEN

BACKGROUND: Modelling the interaction between potentially antigenic peptides and Major Histocompatibility Complex (MHC) molecules is a key step in identifying potential T-cell epitopes. For Class II MHC alleles, the binding groove is open at both ends, causing ambiguity in the positional alignment between the groove and peptide, as well as creating uncertainty as to what parts of the peptide interact with the MHC. Moreover, the antigenic peptides have variable lengths, making naive modelling methods difficult to apply. This paper introduces a kernel method that can handle variable length peptides effectively by quantifying similarities between peptide sequences and integrating these into the kernel. RESULTS: The kernel approach presented here shows increased prediction accuracy with a significantly higher number of true positives and negatives on multiple MHC class II alleles, when testing data sets from MHCPEP 1, MCHBN 2, and MHCBench 3. Evaluation by cross validation, when segregating binders and non-binders, produced an average of 0.824 AROC for the MHCBench data sets (up from 0.756), and an average of 0.96 AROC for multiple alleles of the MHCPEP database. CONCLUSION: The method improves performance over existing state-of-the-art methods of MHC class II peptide binding predictions by using a custom, knowledge-based representation of peptides. Similarity scores, in contrast to a fixed-length, pocket-specific representation of amino acids, provide a flexible and powerful way of modelling MHC binding, and can easily be applied to other dynamic sequence problems.


Asunto(s)
Biología Computacional , Mapeo Epitopo , Antígenos de Histocompatibilidad Clase II/metabolismo , Péptidos/metabolismo , Sitios de Unión , Bases de Datos Genéticas , Antígenos HLA-A/química , Antígenos HLA-A/metabolismo , Antígenos HLA-DR/química , Antígenos HLA-DR/metabolismo , Cadenas HLA-DRB1 , Antígenos de Histocompatibilidad Clase II/química , Humanos , Péptidos/química , Unión Proteica , Conformación Proteica , Curva ROC , Reproducibilidad de los Resultados , Alineación de Secuencia , Análisis de Secuencia de Proteína , Homología de Secuencia de Aminoácido
8.
PLoS One ; 9(9): e106707, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25208077

RESUMEN

A phylogenetic and metagenomic study of elephant feces samples (derived from a three-weeks-old and a six-years-old Asian elephant) was conducted in order to describe the microbiota inhabiting this large land-living animal. The microbial diversity was examined via 16S rRNA gene analysis. We generated more than 44,000 GS-FLX+454 reads for each animal. For the baby elephant, 380 operational taxonomic units (OTUs) were identified at 97% sequence identity level; in the six-years-old animal, close to 3,000 OTUs were identified, suggesting high microbial diversity in the older animal. In both animals most OTUs belonged to Bacteroidetes and Firmicutes. Additionally, for the baby elephant a high number of Proteobacteria was detected. A metagenomic sequencing approach using Illumina technology resulted in the generation of 1.1 Gbp assembled DNA in contigs with a maximum size of 0.6 Mbp. A KEGG pathway analysis suggested high metabolic diversity regarding the use of polymers and aromatic and non-aromatic compounds. In line with the high phylogenetic diversity, a surprising and not previously described biodiversity of glycoside hydrolase (GH) genes was found. Enzymes of 84 GH families were detected. Polysaccharide utilization loci (PULs), which are found in Bacteroidetes, were highly abundant in the dataset; some of these comprised cellulase genes. Furthermore the highest coverage for GH5 and GH9 family enzymes was detected for Bacteroidetes, suggesting that bacteria of this phylum are mainly responsible for the degradation of cellulose in the Asian elephant. Altogether, this study delivers insight into the biomass conversion by one of the largest plant-fed and land-living animals.


Asunto(s)
Lactancia Materna , Elefantes/microbiología , Heces/microbiología , Glicósido Hidrolasas/metabolismo , Metagenómica , Microbiota , Plantas , Animales , Biomasa , Recolección de Datos , Femenino , Glicósido Hidrolasas/genética , Masculino , Filogenia
9.
Expert Opin Drug Discov ; 2(1): 19-35, 2007 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-23496035

RESUMEN

Throughout time functional immunology has accumulated vast amounts of quantitative and qualitative data relevant to the design and discovery of vaccines. Such data includes, but is not limited to, components of the host and pathogen genome (including antigens and virulence factors), T- and B-cell epitopes and other components of the antigen presentation pathway and allergens. In this review the authors discuss a range of databases that archive such data. Built on such information, increasingly sophisticated data mining techniques have developed that create predictive models of utilitarian value. With special reference to epitope data, the authors discuss the strengths and weaknesses of the available techniques and how they can aid computer-aided vaccine design deliver added value for vaccinology.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA