Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 27
Filtrar
1.
BMC Bioinformatics ; 20(Suppl 24): 677, 2019 Dec 20.
Artículo en Inglés | MEDLINE | ID: mdl-31861981

RESUMEN

BACKGROUND: Signal peptides play an important role in protein sorting, which is the mechanism whereby proteins are transported to their destination. Recognition of signal peptides is an important first step in determining the active locations and functions of proteins. Many computational methods have been proposed to facilitate signal peptide recognition. In recent years, the development of deep learning methods has seen significant advances in many research fields. However, most existing models for signal peptide recognition use one-hidden-layer neural networks or hidden Markov models, which are relatively simple in comparison with the deep neural networks that are used in other fields. RESULTS: This study proposes a convolutional neural network without fully connected layers, which is an important network improvement in computer vision. The proposed network is more complex in comparison with current signal peptide predictors. The experimental results show that the proposed network outperforms current signal peptide predictors on eukaryotic data. This study also demonstrates how model reduction and data augmentation helps the proposed network to predict bacterial data. CONCLUSIONS: The study makes three contributions to this subject: (a) an accurate signal peptide recognizer is developed, (b) the potential to leverage advanced networks from other fields is demonstrated and (c) important modifications are proposed while adopting complex networks on signal peptide recognition.


Asunto(s)
Semántica , Aprendizaje Profundo , Redes Neurales de la Computación , Señales de Clasificación de Proteína , Programas Informáticos
2.
BMC Bioinformatics ; 19(1): 169, 2018 05 09.
Artículo en Inglés | MEDLINE | ID: mdl-29743010

RESUMEN

BACKGROUND: Zebrafish is a widely used model organism for studying heart development and cardiac-related pathogenesis. With the ability of surviving without a functional circulation at larval stages, strong genetic similarity between zebrafish and mammals, prolific reproduction and optically transparent embryos, zebrafish is powerful in modeling mammalian cardiac physiology and pathology as well as in large-scale high throughput screening. However, an economical and convenient tool for rapid evaluation of fish cardiac function is still in need. There have been several image analysis methods to assess cardiac functions in zebrafish embryos/larvae, but they are still improvable to reduce manual intervention in the entire process. This work developed a fully automatic method to calculate heart rate, an important parameter to analyze cardiac function, from videos. It contains several filters to identify the heart region, to reduce video noise and to calculate heart rates. RESULTS: The proposed method was evaluated with 32 zebrafish larval cardiac videos that were recording at three-day post-fertilization. The heart rate measured by the proposed method was comparable to that determined by manual counting. The experimental results show that the proposed method does not lose accuracy while largely reducing the labor cost and uncertainty of manual counting. CONCLUSIONS: With the proposed method, researchers do not have to manually select a region of interest before analyzing videos. Moreover, filters designed to reduce video noise can alleviate background fluctuations during the video recording stage (e.g. shifting), which makes recorders generate usable videos easily and therefore reduce manual efforts while recording.


Asunto(s)
Frecuencia Cardíaca/fisiología , Larva/fisiología , Grabación de Cinta de Video/métodos , Pez Cebra/fisiología , Animales
3.
BMC Bioinformatics ; 16 Suppl 18: S11, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26680734

RESUMEN

BACKGROUND: Next-generation sequencing (NGS) technologies has brought an unprecedented amount of genomic data for analysis. Unlike array-based profiling technologies, NGS can reveal the expression profile across a transcript at the base level. Such a base-level read coverage provides further insights for alternative mRNA splicing, single-nucleotide polymorphism (SNP), novel transcript discovery, etc. However, to our best knowledge, none of existing NGS viewers can timely visualize genome-wide base-level read coverages in an interactive environment. RESULTS: This study proposes an efficient visualization pipeline and implements a lightweight read coverage viewer, Light-RCV, with the proposed pipeline. Light-RCV consists of four featured designs on the path from raw NGS data to the final visualized read coverage: i) read coverage construction algorithm, ii) multi-resolution profiles, iii) two-stage architecture and iv) storage format. With these designs, Light-RCV achieves a < 0.5s response time on any scale of genomic ranges, including whole chromosomes. Finally, a case study was performed to demonstrate the importance of visualizing base-level read coverage and the value of Light-RCV. CONCLUSIONS: Compared with multi-functional genome viewers such as Artemis, Savant, Tablet and Integrative Genomics Viewer (IGV), Light-RCV is designed only for visualization. Therefore, it does not provide advanced analyses. However, its backend technology provides an efficient kernel of base-level visualization that can be easily embedded to other viewers. This viewer is the first to provide timely visualization of genome-wide read coverage at the base level in an interactive environment. The software is available for free at http://lightrcv.ee.ncku.edu.tw.


Asunto(s)
Algoritmos , Genómica , Genoma Fúngico , Secuenciación de Nucleótidos de Alto Rendimiento , Internet , Polimorfismo de Nucleótido Simple , Empalme del ARN , Saccharomyces cerevisiae/genética , Análisis de Secuencia de ADN , Interfaz Usuario-Computador
4.
Nucleic Acids Res ; 40(Database issue): D472-8, 2012 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-22084200

RESUMEN

This work presents the Apo-Holo DataBase (AH-DB, http://ahdb.ee.ncku.edu.tw/ and http://ahdb.csbb.ntu.edu.tw/), which provides corresponding pairs of protein structures before and after binding. Conformational transitions are commonly observed in various protein interactions that are involved in important biological functions. For example, copper-zinc superoxide dismutase (SOD1), which destroys free superoxide radicals in the body, undergoes a large conformational transition from an 'open' state (apo structure) to a 'closed' state (holo structure). Many studies have utilized collections of apo-holo structure pairs to investigate the conformational transitions and critical residues. However, the collection process is usually complicated, varies from study to study and produces a small-scale data set. AH-DB is designed to provide an easy and unified way to prepare such data, which is generated by identifying/mapping molecules in different Protein Data Bank (PDB) entries. Conformational transitions are identified based on a refined alignment scheme to overcome the challenge that many structures in the PDB database are only protein fragments and not complete proteins. There are 746,314 apo-holo pairs in AH-DB, which is about 30 times those in the second largest collection of similar data. AH-DB provides sophisticated interfaces for searching apo-holo structure pairs and exploring conformational transitions from apo structures to the corresponding holo structures.


Asunto(s)
Bases de Datos de Proteínas , Conformación Proteica , Modelos Moleculares , Unión Proteica , Superóxido Dismutasa/química , Superóxido Dismutasa-1 , Interfaz Usuario-Computador
5.
Nucleic Acids Res ; 40(Web Server issue): W173-9, 2012 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-22693214

RESUMEN

By binding to short and highly conserved DNA sequences in genomes, DNA-binding proteins initiate, enhance or repress biological processes. Accurately identifying such binding sites, often represented by position weight matrices (PWMs), is an important step in understanding the control mechanisms of cells. When given coordinates of a DNA-binding domain (DBD) bound with DNA, a potential function can be used to estimate the change of binding affinity after base substitutions, where the changes can be summarized as a PWM. This technique provides an effective alternative when the chromatin immunoprecipitation data are unavailable for PWM inference. To facilitate the procedure of predicting PWMs based on protein-DNA complexes or even structures of the unbound state, the web server, DBD2BS, is presented in this study. The DBD2BS uses an atom-level knowledge-based potential function to predict PWMs characterizing the sequences to which the query DBD structure can bind. For unbound queries, a list of 1066 DBD-DNA complexes (including 1813 protein chains) is compiled for use as templates for synthesizing bound structures. The DBD2BS provides users with an easy-to-use interface for visualizing the PWMs predicted based on different templates and the spatial relationships of the query protein, the DBDs and the DNAs. The DBD2BS is the first attempt to predict PWMs of DBDs from unbound structures rather than from bound ones. This approach increases the number of existing protein structures that can be exploited when analyzing protein-DNA interactions. In a recent study, the authors showed that the kernel adopted by the DBD2BS can generate PWMs consistent with those obtained from the experimental data. The use of DBD2BS to predict PWMs can be incorporated with sequence-based methods to discover binding sites in genome-wide studies. Available at: http://dbd2bs.csie.ntu.edu.tw/, http://dbd2bs.csbb.ntu.edu.tw/, and http://dbd2bs.ee.ncku.edu.tw.


Asunto(s)
Proteínas de Unión al ADN/química , Programas Informáticos , Sitios de Unión , Proteína Receptora de AMP Cíclico/química , Proteína Receptora de AMP Cíclico/metabolismo , ADN/química , ADN/metabolismo , Proteínas de Unión al ADN/metabolismo , Internet , Posición Específica de Matrices de Puntuación , Estructura Terciaria de Proteína , Interfaz Usuario-Computador
6.
Bioinformatics ; 28(16): 2162-8, 2012 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-22753780

RESUMEN

MOTIVATION: Determination of the binding affinity of a protein-ligand complex is important to quantitatively specify whether a particular small molecule will bind to the target protein. Besides, collection of comprehensive datasets for protein-ligand complexes and their corresponding binding affinities is crucial in developing accurate scoring functions for the prediction of the binding affinities of previously unknown protein-ligand complexes. In the past decades, several databases of protein-ligand-binding affinities have been created via visual extraction from literature. However, such approaches are time-consuming and most of these databases are updated only a few times per year. Hence, there is an immediate demand for an automatic extraction method with high precision for binding affinity collection. RESULT: We have created a new database of protein-ligand-binding affinity data, AutoBind, based on automatic information retrieval. We first compiled a collection of 1586 articles where the binding affinities have been marked manually. Based on this annotated collection, we designed four sentence patterns that are used to scan full-text articles as well as a scoring function to rank the sentences that match our patterns. The proposed sentence patterns can effectively identify the binding affinities in full-text articles. Our assessment shows that AutoBind achieved 84.22% precision and 79.07% recall on the testing corpus. Currently, 13 616 protein-ligand complexes and the corresponding binding affinities have been deposited in AutoBind from 17 221 articles. AVAILABILITY: AutoBind is automatically updated on a monthly basis, and it is freely available at http://autobind.csie.ncku.edu.tw/ and http://autobind.mc.ntu.edu.tw/. All of the deposited binding affinities have been refined and approved manually before being released.


Asunto(s)
Bases de Datos Factuales , Almacenamiento y Recuperación de la Información/métodos , Ligandos , Unión Proteica , Programas Informáticos , Algoritmos , Biología Computacional/métodos
7.
Nucleic Acids Res ; 39(Database issue): D647-52, 2011 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-21045055

RESUMEN

This study presents the Yeast Promoter Atlas (YPA, http://ypa.ee.ncku.edu.tw/ or http://ypa.csbb.ntu.edu.tw/) database, which aims to collect comprehensive promoter features in Saccharomyces cerevisiae. YPA integrates nine kinds of promoter features including promoter sequences, genes' transcription boundaries-transcription start sites (TSSs), five prime untranslated regions (5'-UTRs) and three prime untranslated regions (3'UTRs), TATA boxes, transcription factor binding sites (TFBSs), nucleosome occupancy, DNA bendability, transcription factor (TF) binding, TF knockout expression and TF-TF physical interaction. YPA is designed to present data in a unified manner as many important observations are revealed only when these promoter features are considered altogether. For example, DNA rigidity can prevent nucleosome packaging, thereby making TFBSs in the rigid DNA regions more accessible to TFs. Integrating nucleosome occupancy, DNA bendability, TF binding, TF knockout expression and TFBS data helps to identify which TFBS is actually functional. In YPA, various promoter features can be accessed in a centralized and organized platform. Researchers can easily view if the TFBSs in an interested promoter are occupied by nucleosomes or located in a rigid DNA segment and know if the expression of the downstream gene responds to the knockout of the corresponding TFs. Compared to other established yeast promoter databases, YPA collects not only TFBSs but also many other promoter features to help biologists study transcriptional regulation.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , Regiones Promotoras Genéticas , Saccharomyces cerevisiae/genética , Sitios de Unión , Integración de Sistemas , Factores de Transcripción/metabolismo , Interfaz Usuario-Computador
8.
J Adv Res ; 2023 Dec 29.
Artículo en Inglés | MEDLINE | ID: mdl-38159844

RESUMEN

INTRODUCTION: The population of Taiwan has a long history of ethno-cultural evolution. The Taiwanese population was isolated from other large populations such as the European, Han Chinese, and Japanese population. The Taiwan Biobank (TWB) project has built a nationwide database, particularly for personal whole-genome sequence (WGS) to facilitate basic and clinical collaboration nationally and internationally, making it one of the most valuable public datasets of the East Asian population. OBJECTIVES: This study provides comprehensive medical genomic findings from TWB WGS data, for better characterization of disease susceptibility and the choice of ideal treatment regimens in Taiwanese population. METHODS: We reanalyzed 1496 WGS using a PrecisionFDA Truth challenge winner method Sentieon DNAscope. Single nucleotide variants (SNV) and small insertions/deletions (INDEL) were benchmarked. We also analyzed pharmacogenomic (PGx) drug-associated alleles, and copy number variants (CNV). Multiple practicing clinicians reviewed and curated the clinically significant variants. Variant annotations can be browsed at TaiwanGenomes (https://genomes.tw). RESULTS: We found that each participant had an average of 6,870.7 globally novel variants and 75.3% (831/1103) of the participants harbored at least one PharmGKB-selected high evidence level human leukocyte antigen (HLA) risk allele. 54 PharmGKB-reported high-level instances of evidence of Cytochrome P450 variant-drug pairs, with a population frequency of over 13.2%. We also identified 23 variants in the ACMG secondary finding V3 gene list from 25 participants, suggesting that 1.67% (25/1496) of the population is harboring at least one medical actionable variant. Our carrier status analyses suggest that one in 25 couples (3.94%) would risk having offspring with at least one pathogenic variant, which is in line with rates found in Japan and Singapore. For pathogenic CNV, we detected 6.88% and 2.02% carrier rates for alpha thalassemia and spinal muscular atrophy, respectively. CONCLUSION: Our study highlights the overall medical insights of a complete Taiwanese genomic profile.

9.
BMC Genomics ; 13 Suppl 1: S11, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22369481

RESUMEN

BACKGROUND: Head-to-head (h2h) genes are prone to have association in expression and in functionality and have been shown conserved in evolution. Currently there are many studies on such h2h gene pairs. We found that the previous studies extremely focused on human genome. Furthermore, they only focused on analyses that require only gene or protein sequences but not conducted a systematic investigation on other promoter features such as the binding evidence of specific transcription factors (TFs). This is mainly because of the incomplete resources of higher organisms, though they are relatively of interest, than model organisms such as Saccharomyces cerevisiae. The authors of this study recently integrated nine promoter features of 6603 genes of S. cerevisiae from six databases and five papers. These resources are suitable to conduct a comprehensive analysis of h2h genes in S. cerevisiae. RESULTS: This study analyzed various promoter features, including transcription boundaries (TSS, 5'UTR and 3'UTR), TATA box, TF binding evidence, TF regulation evidence, DNA bendability and nucleosome occupancy. The expression profiles and gene ontology (GO) annotations were used to measure if two genes are associated. Based on these promoter features, we found that i) the frequency of h2h genes was close to the expectation, namely they were not relatively frequent in genome; ii) the distance between the TSSs of most h2h genes fell into the range of 0-600 bps and was more centralized in 0-200 bps of the highly associated ones; iii) the number of TFs that regulate both h2h genes influenced the co-expression and co-function of the genes, while the number of TFs that bind both h2h genes influenced only the co-expression of the genes; iv) the association of two h2h genes was influenced by the existence of specific TFs such as STP2; v) the association of h2h genes whose bidirectional promoters have no TATA box was slightly higher than those who have TATA boxes; vi) the association of two h2h genes was not influenced by the DNA bendability and nucleosome occupancy. CONCLUSIONS: This study analyzed h2h genes with various promoter features that have not been used in analyzing h2h genes. The results can be applied to other genomes to confirm if the observations of this study are limited to S. cerevisiae or universal in most organisms.


Asunto(s)
Regiones Promotoras Genéticas/genética , Proteínas de Saccharomyces cerevisiae/genética , Genoma Fúngico/genética , Factores de Transcripción/genética
10.
BMC Bioinformatics ; 12 Suppl 1: S32, 2011 Feb 15.
Artículo en Inglés | MEDLINE | ID: mdl-21342563

RESUMEN

BACKGROUND: A common assumption about enzyme active sites is that their structures are highly conserved to specifically distinguish between closely similar compounds. However, with the discovery of distinct enzymes with similar reaction chemistries, more and more studies discussing the structural flexibility of the active site have been conducted. RESULTS: Most of the existing works on the flexibility of active sites focuses on a set of pre-selected active sites that were already known to be flexible. This study, on the other hand, proposes an analysis framework composed of a new data collecting strategy, a local structure alignment tool and several physicochemical measures derived from the alignments. The method proposed to identify flexible active sites is highly automated and robust so that more extensive studies will be feasible in the future. The experimental results show the proposed method is (a) consistent with previous works based on manually identified flexible active sites and (b) capable of identifying potentially new flexible active sites. CONCLUSIONS: This proposed analysis framework and the former analyses on flexibility have their own advantages and disadvantage, depending on the cause of the flexibility. In this regard, this study proposes an alternative that complements previous studies and helps to construct a more comprehensive view of the flexibility of enzyme active sites.


Asunto(s)
Dominio Catalítico , Enzimas/química , Algoritmos , Sitios de Unión , Biología Computacional/métodos , Conformación Proteica , Alineación de Secuencia , Análisis de Secuencia de Proteína , Relación Estructura-Actividad
11.
Nucleic Acids Res ; 37(Web Server issue): W552-8, 2009 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-19477961

RESUMEN

Sequence motifs are important in the study of molecular biology. Motif discovery tools efficiently deliver many function related signatures of proteins and largely facilitate sequence annotation. As increasing numbers of motifs are detected experimentally or predicted computationally, characterizing the functional roles of motifs and identifying the potential synergetic relationships between them are important next steps. A good way to investigate novel motifs is to utilize the abundant 3D structures that have also been accumulated at an astounding rate in recent years. This article reports the development of the web service seeMotif, which provides users with an interactive interface for visualizing sequence motifs on protein structures from the Protein Data Bank (PDB). Researchers can quickly see the locations and conformation of multiple motifs among a number of related structures simultaneously. Considering the fact that PDB sequences are usually shorter than those in sequence databases and/or may have missing residues, seeMotif has two complementary approaches for selecting structures and mapping motifs to protein chains in structures. As more and more structures belonging to previously uncharacterized protein families become available, combining sequence and structure information gives good opportunities to facilitate understanding of protein functions in large-scale genome projects. Available at: http://seemotif.csie.ntu.edu.tw,http://seemotif.ee.ncku.edu.tw or http://seemotif.csbb.ntu.edu.tw.


Asunto(s)
Secuencias de Aminoácidos , Programas Informáticos , Gráficos por Computador , Bases de Datos de Proteínas , Internet , Modelos Moleculares , Interfaz Usuario-Computador
12.
BMC Bioinformatics ; 11 Suppl 1: S3, 2010 Jan 18.
Artículo en Inglés | MEDLINE | ID: mdl-20122202

RESUMEN

BACKGROUND: Many biological functions involve various protein-protein interactions (PPIs). Elucidating such interactions is crucial for understanding general principles of cellular systems. Previous studies have shown the potential of predicting PPIs based on only sequence information. Compared to approaches that require other auxiliary information, these sequence-based approaches can be applied to a broader range of applications. RESULTS: This study presents a novel sequence-based method based on the assumption that protein-protein interactions are more related to amino acids at the surface than those at the core. The present method considers surface information and maintains the advantage of relying on only sequence data by including an accessible surface area (ASA) predictor recently proposed by the authors. This study also reports the experiments conducted to evaluate a) the performance of PPI prediction achieved by including the predicted surface and b) the quality of the predicted surface in comparison with the surface obtained from structures. The experimental results show that surface information helps to predict interacting protein pairs. Furthermore, the prediction performance achieved by using the surface estimated with the ASA predictor is close to that using the surface obtained from protein structures. CONCLUSION: This work presents a sequence-based method that takes into account surface information for predicting PPIs. The proposed procedure of surface identification improves the prediction performance with an F-measure of 5.1%. The extracted surfaces are also valuable in other biomedical applications that require similar information.


Asunto(s)
Secuencia de Aminoácidos , Mapeo de Interacción de Proteínas/métodos , Proteínas/química , Proteínas/metabolismo , Sitios de Unión , Bases de Datos de Proteínas , Modelos Moleculares , Proteómica/métodos , Análisis de Secuencia de Proteína/métodos , Relación Estructura-Actividad
13.
BMC Bioinformatics ; 11: 167, 2010 Apr 02.
Artículo en Inglés | MEDLINE | ID: mdl-20361868

RESUMEN

BACKGROUND: Elucidating protein-protein interactions (PPIs) is essential to constructing protein interaction networks and facilitating our understanding of the general principles of biological systems. Previous studies have revealed that interacting protein pairs can be predicted by their primary structure. Most of these approaches have achieved satisfactory performance on datasets comprising equal number of interacting and non-interacting protein pairs. However, this ratio is highly unbalanced in nature, and these techniques have not been comprehensively evaluated with respect to the effect of the large number of non-interacting pairs in realistic datasets. Moreover, since highly unbalanced distributions usually lead to large datasets, more efficient predictors are desired when handling such challenging tasks. RESULTS: This study presents a method for PPI prediction based only on sequence information, which contributes in three aspects. First, we propose a probability-based mechanism for transforming protein sequences into feature vectors. Second, the proposed predictor is designed with an efficient classification algorithm, where the efficiency is essential for handling highly unbalanced datasets. Third, the proposed PPI predictor is assessed with several unbalanced datasets with different positive-to-negative ratios (from 1:1 to 1:15). This analysis provides solid evidence that the degree of dataset imbalance is important to PPI predictors. CONCLUSIONS: Dealing with data imbalance is a key issue in PPI prediction since there are far fewer interacting protein pairs than non-interacting ones. This article provides a comprehensive study on this issue and develops a practical tool that achieves both good prediction performance and efficiency using only protein sequence information.


Asunto(s)
Mapeo de Interacción de Proteínas/métodos , Proteínas/química , Proteómica/métodos , Secuencia de Aminoácidos , Sitios de Unión , Bases de Datos de Proteínas , Proteínas/metabolismo , Análisis de Secuencia de Proteína
14.
BMC Bioinformatics ; 11 Suppl 1: S52, 2010 Jan 18.
Artículo en Inglés | MEDLINE | ID: mdl-20122227

RESUMEN

BACKGROUND: MicroRNAs (miRNAs) are short non-coding RNA molecules, which play an important role in post-transcriptional regulation of gene expression. There have been many efforts to discover miRNA precursors (pre-miRNAs) over the years. Recently, ab initio approaches have attracted more attention because they do not depend on homology information and provide broader applications than comparative approaches. Kernel based classifiers such as support vector machine (SVM) are extensively adopted in these ab initio approaches due to the prediction performance they achieved. On the other hand, logic based classifiers such as decision tree, of which the constructed model is interpretable, have attracted less attention. RESULTS: This article reports the design of a predictor of pre-miRNAs with a novel kernel based classifier named the generalized Gaussian density estimator (G2DE) based classifier. The G2DE is a kernel based algorithm designed to provide interpretability by utilizing a few but representative kernels for constructing the classification model. The performance of the proposed predictor has been evaluated with 692 human pre-miRNAs and has been compared with two kernel based and two logic based classifiers. The experimental results show that the proposed predictor is capable of achieving prediction performance comparable to those delivered by the prevailing kernel based classification algorithms, while providing the user with an overall picture of the distribution of the data set. CONCLUSION: Software predictors that identify pre-miRNAs in genomic sequences have been exploited by biologists to facilitate molecular biology research in recent years. The G2DE employed in this study can deliver prediction accuracy comparable with the state-of-the-art kernel based machine learning algorithms. Furthermore, biologists can obtain valuable insights about the different characteristics of the sequences of pre-miRNAs with the models generated by the G2DE based predictor.


Asunto(s)
Algoritmos , Genómica/métodos , MicroARNs/química , Secuencia de Bases , Genoma , Humanos , MicroARNs/metabolismo , Análisis de Secuencia de ARN
15.
Nucleic Acids Res ; 36(Web Server issue): W291-6, 2008 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-18524800

RESUMEN

Large-scale automatic annotation of protein sequences remains challenging in postgenomics era. E1DS is designed for annotating enzyme sequences based on a repository of 1D signatures. The employed sequence signatures are derived using a novel pattern mining approach that discovers long motifs consisted of several sequential blocks (conserved segments). Each of the sequential blocks is considerably conserved among the protein members of an EC group. Moreover, a signature includes at least three sequential blocks that are concurrently conserved, i.e. frequently observed together in sequences. In other words, a sequence signature is consisted of residues from multiple regions of the protein sequence, which echoes the observation that an enzyme catalytic site is usually constituted of residues that are largely separated in the sequence. E1DS currently contains 5421 sequence signatures that in total cover 932 4-digital EC numbers. E1DS is evaluated based on a collection of enzymes with catalytic sites annotated in Catalytic Site Atlas. When compared to the famous pattern database PROSITE, predictions based on E1DS signatures are considered more sensitive in identifying catalytic sites and the involved residues. E1DS is available at http://e1ds.ee.ncku.edu.tw/ and a mirror site can be found at http://e1ds.csbb.ntu.edu.tw/.


Asunto(s)
Dominio Catalítico , Enzimas/química , Programas Informáticos , Secuencia de Aminoácidos , Secuencia Conservada , Internet , Análisis de Secuencia de Proteína , Interfaz Usuario-Computador
16.
BMC Bioinformatics ; 9 Suppl 12: S2, 2008 Dec 12.
Artículo en Inglés | MEDLINE | ID: mdl-19091019

RESUMEN

BACKGROUND: MicroRNAs (miRNAs) are short non-coding RNA molecules participating in post-transcriptional regulation of gene expression. There have been many efforts to discover miRNA precursors (pre-miRNAs) over the years. Recently, ab initio approaches obtain more attention because that they can discover species-specific pre-miRNAs. Most ab initio approaches proposed novel features to characterize RNA molecules. However, there were fewer discussions on the associated classification mechanism in a miRNA predictor. RESULTS: This study focuses on the classification algorithm for miRNA prediction. We develop a novel ab initio method, miR-KDE, in which most of the features are collected from previous works. The classification mechanism in miR-KDE is the relaxed variable kernel density estimator (RVKDE) that we have recently proposed. When compared to the famous support vector machine (SVM), RVKDE exploits more local information of the training dataset. MiR-KDE is evaluated using a training set consisted of only human pre-miRNAs to predict a benchmark collected from 40 species. The experimental results show that miR-KDE delivers favorable performance in predicting human pre-miRNAs and has advantages for pre-miRNAs from the genera taxonomically distant to humans. CONCLUSION: We use a novel classifier of which the characteristic of exploiting local information is particularly suitable to predict species-specific pre-miRNAs. This study also provides a comprehensive analysis from the view of classification mechanism. The good performance of miR-KDE encourages more efforts on the classification methodology as well as the feature extraction in miRNA prediction.


Asunto(s)
Biología Computacional/métodos , MicroARNs/metabolismo , Algoritmos , Inteligencia Artificial , Regulación de la Expresión Génica , Humanos , Modelos Estadísticos , Reconocimiento de Normas Patrones Automatizadas , ARN/química , Reproducibilidad de los Resultados , Alineación de Secuencia/métodos , Programas Informáticos , Especificidad de la Especie
17.
BMC Bioinformatics ; 9 Suppl 12: S12, 2008 Dec 12.
Artículo en Inglés | MEDLINE | ID: mdl-19091011

RESUMEN

BACKGROUND: Prediction of protein solvent accessibility, also called accessible surface area (ASA) prediction, is an important step for tertiary structure prediction directly from one-dimensional sequences. Traditionally, predicting solvent accessibility is regarded as either a two- (exposed or buried) or three-state (exposed, intermediate or buried) classification problem. However, the states of solvent accessibility are not well-defined in real protein structures. Thus, a number of methods have been developed to directly predict the real value ASA based on evolutionary information such as position specific scoring matrix (PSSM). RESULTS: This study enhances the PSSM-based features for real value ASA prediction by considering the physicochemical properties and solvent propensities of amino acid types. We propose a systematic method for identifying residue groups with respect to protein solvent accessibility. The amino acid columns in the PSSM profile that belong to a certain residue group are merged to generate novel features. Finally, support vector regression (SVR) is adopted to construct a real value ASA predictor. Experimental results demonstrate that the features produced by the proposed selection process are informative for ASA prediction. CONCLUSION: Experimental results based on a widely used benchmark reveal that the proposed method performs best among several of existing packages for performing ASA prediction. Furthermore, the feature selection mechanism incorporated in this study can be applied to other regression problems using the PSSM. The program and data are available from the authors upon request.


Asunto(s)
Biología Computacional/métodos , Bases de Datos de Proteínas , Proteínas/química , Solventes/química , Algoritmos , Aminoácidos/química , Química Física/métodos , Modelos Estadísticos , Conformación Proteica , Análisis de Regresión , Reproducibilidad de los Resultados , Análisis de Secuencia de Proteína , Programas Informáticos , Propiedades de Superficie
18.
Nucleic Acids Res ; 34(Web Server issue): W303-9, 2006 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-16845015

RESUMEN

UNLABELLED: Geometrical analysis of protein tertiary substructures has been an effective approach employed to predict protein binding sites. This article presents the Protemot web server that carries out prediction of protein binding sites based on the structural templates automatically extracted from the crystal structures of protein-ligand complexes in the PDB (Protein Data Bank). The automatic extraction mechanism is essential for creating and maintaining a comprehensive template library that timely accommodates to the new release of PDB as the number of entries continues to grow rapidly. The design of Protemot is also distinctive by the mechanism employed to expedite the analysis process that matches the tertiary substructures on the contour of the query protein with the templates in the library. This expediting mechanism is essential for providing reasonable response time to the user as the number of entries in the template library continues to grow rapidly due to rapid growth of the number of entries in PDB. This article also reports the experiments conducted to evaluate the prediction power delivered by the Protemot web server. Experimental results show that Protemot can deliver a superior prediction power than a web server based on a manually curated template library with insufficient quantity of entries. AVAILABILITY: http://protemot.csie.ntu.edu.tw/step1.cgi http://bioinfo.mc.ntu.edu.tw/protemot/step1.cgi.


Asunto(s)
Estructura Terciaria de Proteína , Programas Informáticos , Homología Estructural de Proteína , Sitios de Unión , Cristalografía por Rayos X , Bases de Datos de Proteínas , Internet , Ligandos , Modelos Moleculares , Proteínas/química , Proteínas/metabolismo , Interfaz Usuario-Computador
19.
Artículo en Inglés | MEDLINE | ID: mdl-27016699

RESUMEN

In many biological processes, proteins have important interactions with various molecules such as proteins, ions or ligands. Many proteins undergo conformational changes upon these interactions, where regions with large conformational changes are critical to the interactions. This work presents the CCProf platform, which provides conformational changes of entire proteins, named conformational change profile (CCP) in the context. CCProf aims to be a platform where users can study potential causes of novel conformational changes. It provides 10 biological features, including conformational change, potential binding target site, secondary structure, conservation, disorder propensity, hydropathy propensity, sequence domain, structural domain, phosphorylation site and catalytic site. All these information are integrated into a well-aligned view, so that researchers can capture important relevance between different biological features visually. The CCProf contains 986,187 protein structure pairs for 3123 proteins. In addition, CCProf provides a 3D view in which users can see the protein structures before and after conformational changes as well as binding targets that induce conformational changes. All information (e.g. CCP, binding targets and protein structures) shown in CCProf, including intermediate data are available for download to expedite further analyses. Database URL:http://zoro.ee.ncku.edu.tw/ccprof/.


Asunto(s)
Bases de Datos de Proteínas , Proteínas/química , Sitios de Unión , Secuencia Conservada , Ligandos , Conformación Proteica , Motor de Búsqueda , Interfaz Usuario-Computador
20.
Artículo en Inglés | MEDLINE | ID: mdl-27242036

RESUMEN

In eukaryotic cells, transcriptional regulation of gene expression is usually accomplished by cooperative Transcription Factors (TFs). Therefore, knowing cooperative TFs is helpful for uncovering the mechanisms of transcriptional regulation. In yeast, many cooperative TF pairs have been predicted by various algorithms in the literature. However, until now, there is still no database which collects the predicted yeast cooperative TFs from existing algorithms. This prompts us to construct Cooperative Transcription Factors Database (CoopTFD), which has a comprehensive collection of 2622 predicted cooperative TF pairs (PCTFPs) in yeast from 17 existing algorithms. For each PCTFP, our database also provides five types of validation information: (i) the algorithms which predict this PCTFP, (ii) the publications which experimentally show that this PCTFP has physical or genetic interactions, (iii) the publications which experimentally study the biological roles of both TFs of this PCTFP, (iv) the common Gene Ontology (GO) terms of this PCTFP and (v) the common target genes of this PCTFP. Based on the provided validation information, users can judge the biological plausibility of a PCTFP of interest. We believe that CoopTFD will be a valuable resource for yeast biologists to study the combinatorial regulation of gene expression controlled by cooperative TFs.Database URL: http://cosbi.ee.ncku.edu.tw/CoopTFD/ or http://cosbi2.ee.ncku.edu.tw/CoopTFD/.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Regulación Fúngica de la Expresión Génica/genética , Proteínas de Saccharomyces cerevisiae/genética , Factores de Transcripción/genética , Algoritmos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA