Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 49
Filter
1.
J Proteome Res ; 19(11): 4624-4636, 2020 11 06.
Article in English | MEDLINE | ID: mdl-32654489

ABSTRACT

There have been more than 2.2 million confirmed cases and over 120 000 deaths from the human coronavirus disease 2019 (COVID-19) pandemic, caused by the novel severe acute respiratory syndrome coronavirus (SARS-CoV-2), in the United States alone. However, there is currently a lack of proven effective medications against COVID-19. Drug repurposing offers a promising route for the development of prevention and treatment strategies for COVID-19. This study reports an integrative, network-based deep-learning methodology to identify repurposable drugs for COVID-19 (termed CoV-KGE). Specifically, we built a comprehensive knowledge graph that includes 15 million edges across 39 types of relationships connecting drugs, diseases, proteins/genes, pathways, and expression from a large scientific corpus of 24 million PubMed publications. Using Amazon's AWS computing resources and a network-based, deep-learning framework, we identified 41 repurposable drugs (including dexamethasone, indomethacin, niclosamide, and toremifene) whose therapeutic associations with COVID-19 were validated by transcriptomic and proteomics data in SARS-CoV-2-infected human cells and data from ongoing clinical trials. Whereas this study by no means recommends specific drugs, it demonstrates a powerful deep-learning methodology to prioritize existing drugs for further investigation, which holds the potential to accelerate therapeutic development for COVID-19.


Subject(s)
Betacoronavirus , Coronavirus Infections , Deep Learning , Drug Repositioning/methods , Pandemics , Pneumonia, Viral , Antiviral Agents , COVID-19 , Coronavirus Infections/drug therapy , Coronavirus Infections/virology , Humans , Pneumonia, Viral/drug therapy , Pneumonia, Viral/virology , Proteome , SARS-CoV-2 , Transcriptome
2.
Biotechnol Bioeng ; 111(4): 770-81, 2014 Apr.
Article in English | MEDLINE | ID: mdl-24249083

ABSTRACT

Baby Hamster Kidney (BHK) cell lines are used in the production of veterinary vaccines and recombinant proteins. To facilitate transcriptome analysis of BHK cell lines, we embarked on an effort to sequence, assemble, and annotate transcript sequences from a recombinant BHK cell line and Syrian hamster liver and brain. RNA-seq data were supplemented with 6,170 Sanger ESTs from parental and recombinant BHK lines to generate 221,583 contigs. Annotation by homology to other species, primarily mouse, yielded more than 15,000 unique Ensembl mouse gene IDs with high coverage of KEGG canonical pathways. High coverage of enzymes and isoforms was seen for cell metabolism and N-glycosylation pathways, areas of highest interest for biopharmaceutical production. With the high sequencing depth in RNA-seq data, we set out to identify single-nucleotide variants in the transcripts. A majority of the high-confidence variants detected in both hamster tissue libraries occurred at a frequency of 50%, indicating their origin as heterozygous germline variants. In contrast, the cell line libraries' variants showed a wide range of occurrence frequency, indicating the presence of a heterogeneous population in cultured cells. The extremely high coverage of transcripts of highly abundant genes in RNA-seq enabled us to identify low-frequency variants. Experimental verification through Sanger sequencing confirmed the presence of two variants in the cDNA of a highly expressed gene in the BHK cell line. Furthermore, we detected seven potential missense mutations in the genes of the growth signaling pathways that may have arisen during the cell line derivation process. The development and characterization of a BHK reference transcriptome will facilitate future efforts to understand, monitor, and manipulate BHK cells. Our study on sequencing variants is crucial for improved understanding of the errors inherent in high-throughput sequencing and to increase the accuracy of variant calling in BHK or other systems.


Subject(s)
Gene Expression Profiling/methods , High-Throughput Nucleotide Sequencing/methods , Transcriptome/genetics , Animals , Brain/metabolism , Brain Chemistry , Cell Line , Cricetinae , Female , Glycolysis , Liver/chemistry , Liver/metabolism , Mesocricetus , Organ Specificity , Polysaccharides , RNA, Messenger/analysis , RNA, Messenger/genetics , RNA, Messenger/metabolism , Sequence Analysis, RNA
3.
Proteins ; 81(5): 754-73, 2013 May.
Article in English | MEDLINE | ID: mdl-23184763

ABSTRACT

Coarse-grained models for protein structure are increasingly used in simulations and structural bioinformatics. In this study, we evaluated the effectiveness of three granularities of protein representation based on their ability to discriminate between correctly folded native structures and incorrectly folded decoy structures. The three levels of representation used one bead per amino acid (coarse), two beads per amino acid (medium), and all atoms (fine). Multiple structure features were compared at each representation level including two-body interactions, three-body interactions, solvent exposure, contact numbers, and angle bending. In most cases, the all-atom level was most successful at discriminating decoys, but the two-bead level provided a good compromise between the number of model parameters which must be estimated and the accuracy achieved. The most effective feature type appeared to be two-body interactions. Considering three-body interactions increased accuracy only marginally when all atoms were used and not at all in medium and coarse representations. Though two-body interactions were most effective for the coarse representations, the accuracy loss for using only solvent exposure or contact number was proportionally less at these levels than in the all-atom representation. We propose an optimization method capable of selecting bead types of different granularities to create a mixed representation of the protein. We illustrate its behavior on decoy discrimination and discuss implications for data-driven protein model selection.


Subject(s)
Artificial Intelligence , Proteins/chemistry , Models, Molecular , Protein Conformation , Solvents
4.
J Chem Inf Model ; 52(1): 38-50, 2012 Jan 23.
Article in English | MEDLINE | ID: mdl-22107358

ABSTRACT

The identification of small potent compounds that selectively bind to the target under consideration with high affinities is a critical step toward successful drug discovery. However, there is still a lack of efficient and accurate computational methods to predict compound selectivity properties. In this paper, we propose a set of machine learning methods to do compound selectivity prediction. In particular, we propose a novel cascaded learning method and a multitask learning method. The cascaded method decomposes the selectivity prediction into two steps, one model for each step, so as to effectively filter out nonselective compounds. The multitask method incorporates both activity and selectivity models into one multitask model so as to better differentiate compound selectivity properties. We conducted a comprehensive set of experiments and compared the results with those of other conventional selectivity prediction methods, and our results demonstrated that the cascaded and multitask methods significantly improve the selectivity prediction performance.


Subject(s)
Computational Biology/methods , Drug Discovery/methods , Small Molecule Libraries/chemistry , Software , Algorithms , Artificial Intelligence , Models, Biological , Neural Networks, Computer , Predictive Value of Tests
5.
Mol Inform ; 41(8): e2100321, 2022 08.
Article in English | MEDLINE | ID: mdl-35156325

ABSTRACT

In this work, we benchmark a variety of single- and multi-task graph neural network (GNN) models against lower-bar and higher-bar traditional machine learning approaches employing human engineered molecular features. We consider four GNN variants - Graph Convolutional Network (GCN), Graph Attention Network (GAT), Message Passing Neural Network (MPNN), and Attentive Fingerprint (AttentiveFP). So far deep learning models have been primarily benchmarked using lower-bar traditional models solely based on fingerprints, while more realistic benchmarks employing fingerprints, whole-molecule descriptors and predictions from other related endpoints (e. g., LogD7.4) appear to be scarce for industrial ADME datasets. In addition to time-split test sets based on Genentech data, this study benefits from the availability of measurements from an external chemical space (Roche data). We identify GAT as a promising approach to implementing deep learning models. While all the deep learning models significantly outperform lower-bar benchmark traditional models solely based on fingerprints, only GATs seem to offer a small but consistent improvement over higher-bar benchmark traditional models. Finally, the accuracy of in vitro assays from different laboratories predicting the same experimental endpoints appears to be comparable with the accuracy of GAT single-task models, suggesting that most of the observed error from the models is a function of the experimental error propagation.


Subject(s)
Benchmarking , Neural Networks, Computer , Humans , Machine Learning
6.
Sci Rep ; 12(1): 4724, 2022 03 18.
Article in English | MEDLINE | ID: mdl-35304504

ABSTRACT

Effective and successful clinical trials are essential in developing new drugs and advancing new treatments. However, clinical trials are very expensive and easy to fail. The high cost and low success rate of clinical trials motivate research on inferring knowledge from existing clinical trials in innovative ways for designing future clinical trials. In this manuscript, we present our efforts on constructing the first publicly available Clinical Trials Knowledge Graph, denoted as [Formula: see text]. [Formula: see text] includes nodes representing medical entities in clinical trials (e.g., studies, drugs and conditions), and edges representing the relations among these entities (e.g., drugs used in studies). Our embedding analysis demonstrates the potential utilities of [Formula: see text] in various applications such as drug repurposing and similarity search, among others.


Subject(s)
Pattern Recognition, Automated
7.
IEEE Trans Neural Netw Learn Syst ; 33(6): 2378-2392, 2022 Jun.
Article in English | MEDLINE | ID: mdl-33819161

ABSTRACT

Anomaly detection on attributed networks attracts considerable research interests due to wide applications of attributed networks in modeling a wide range of complex systems. Recently, the deep learning-based anomaly detection methods have shown promising results over shallow approaches, especially on networks with high-dimensional attributes and complex structures. However, existing approaches, which employ graph autoencoder as their backbone, do not fully exploit the rich information of the network, resulting in suboptimal performance. Furthermore, these methods do not directly target anomaly detection in their learning objective and fail to scale to large networks due to the full graph training mechanism. To overcome these limitations, in this article, we present a novel Contrastive self-supervised Learning framework for Anomaly detection on attributed networks (CoLA for abbreviation). Our framework fully exploits the local information from network data by sampling a novel type of contrastive instance pair, which can capture the relationship between each node and its neighboring substructure in an unsupervised way. Meanwhile, a well-designed graph neural network (GNN)-based contrastive learning model is proposed to learn informative embedding from high-dimensional attributes and local structure and measure the agreement of each instance pairs with its outputted scores. The multiround predicted scores by the contrastive learning model are further used to evaluate the abnormality of each node with statistical estimation. In this way, the learning model is trained by a specific anomaly detection-aware target. Furthermore, since the input of the GNN module is batches of instance pairs instead of the full network, our framework can adapt to large networks flexibly. Experimental results show that our proposed framework outperforms the state-of-the-art baseline methods on all seven benchmark data sets.

8.
Article in English | MEDLINE | ID: mdl-35834453

ABSTRACT

This article aims to unify spatial dependency and temporal dependency in a non-Euclidean space while capturing the inner spatial-temporal dependencies for traffic data. For spatial-temporal attribute entities with topological structure, the space-time is consecutive and unified while each node's current status is influenced by its neighbors' past states over variant periods of each neighbor. Most spatial-temporal neural networks for traffic forecasting study spatial dependency and temporal correlation separately in processing, gravely impaired the spatial-temporal integrity, and ignore the fact that the neighbors' temporal dependency period for a node can be delayed and dynamic. To model this actual condition, we propose TraverseNet, a novel spatial-temporal graph neural network, viewing space and time as an inseparable whole, to mine spatial-temporal graphs while exploiting the evolving spatial-temporal dependencies for each node via message traverse mechanisms. Experiments with ablation and parameter studies have validated the effectiveness of the proposed TraverseNet, and the detailed implementation can be found from https://github.com/nnzhan/TraverseNet.

9.
ACS Omega ; 6(41): 27233-27238, 2021 Oct 19.
Article in English | MEDLINE | ID: mdl-34693143

ABSTRACT

Graph neural networks (GNNs) constitute a class of deep learning methods for graph data. They have wide applications in chemistry and biology, such as molecular property prediction, reaction prediction, and drug-target interaction prediction. Despite the interest, GNN-based modeling is challenging as it requires graph data preprocessing and modeling in addition to programming and deep learning. Here, we present Deep Graph Library (DGL)-LifeSci, an open-source package for deep learning on graphs in life science. Deep Graph Library (DGL)-LifeSci is a python toolkit based on RDKit, PyTorch, and Deep Graph Library (DGL). DGL-LifeSci allows GNN-based modeling on custom datasets for molecular property prediction, reaction prediction, and molecule generation. With its command-line interfaces, users can perform modeling without any background in programming and deep learning. We test the command-line interfaces using standard benchmarks MoleculeNet, USPTO, and ZINC. Compared with previous implementations, DGL-LifeSci achieves a speed up by up to 6×. For modeling flexibility, DGL-LifeSci provides well-optimized modules for various stages of the modeling pipeline. In addition, DGL-LifeSci provides pretrained models for reproducing the test experiment results and applying models without training. The code is distributed under an Apache-2.0 License and is freely accessible at https://github.com/awslabs/dgl-lifesci.

10.
BMC Genomics ; 11: 578, 2010 Oct 18.
Article in English | MEDLINE | ID: mdl-20955611

ABSTRACT

BACKGROUND: The onset of antibiotics production in Streptomyces species is co-ordinated with differentiation events. An understanding of the genetic circuits that regulate these coupled biological phenomena is essential to discover and engineer the pharmacologically important natural products made by these species. The availability of genomic tools and access to a large warehouse of transcriptome data for the model organism, Streptomyces coelicolor, provides incentive to decipher the intricacies of the regulatory cascades and develop biologically meaningful hypotheses. RESULTS: In this study, more than 500 samples of genome-wide temporal transcriptome data, comprising wild-type and more than 25 regulatory gene mutants of Streptomyces coelicolor probed across multiple stress and medium conditions, were investigated. Information based on transcript and functional similarity was used to update a previously-predicted whole-genome operon map and further applied to predict transcriptional networks constituting modules enriched in diverse functions such as secondary metabolism, and sigma factor. The predicted network displays a scale-free architecture with a small-world property observed in many biological networks. The networks were further investigated to identify functionally-relevant modules that exhibit functional coherence and a consensus motif in the promoter elements indicative of DNA-binding elements. CONCLUSIONS: Despite the enormous experimental as well as computational challenges, a systems approach for integrating diverse genome-scale datasets to elucidate complex regulatory networks is beginning to emerge. We present an integrated analysis of transcriptome data and genomic features to refine a whole-genome operon map and to construct regulatory networks at the cistron level in Streptomyces coelicolor. The functionally-relevant modules identified in this study pose as potential targets for further studies and verification.


Subject(s)
Gene Regulatory Networks/genetics , Genome, Bacterial/genetics , Streptomyces coelicolor/genetics , Algorithms , Area Under Curve , Arginine/metabolism , Consensus Sequence/genetics , Operon/genetics , ROC Curve , Reproducibility of Results , Transcription, Genetic
11.
Bioinformatics ; 25(23): 3099-107, 2009 Dec 01.
Article in English | MEDLINE | ID: mdl-19786483

ABSTRACT

MOTIVATION: Identifying residues that interact with ligands is useful as a first step to understanding protein function and as an aid to designing small molecules that target the protein for interaction. Several studies have shown that sequence features are very informative for this type of prediction, while structure features have also been useful when structure is available. We develop a sequence-based method, called LIBRUS, that combines homology-based transfer and direct prediction using machine learning and compare it to previous sequence-based work and current structure-based methods. RESULTS: Our analysis shows that homology-based transfer is slightly more discriminating than a support vector machine learner using profiles and predicted secondary structure. We combine these two approaches in a method called LIBRUS. On a benchmark of 885 sequence-independent proteins, it achieves an area under the ROC curve (ROC) of 0.83 with 45% precision at 50% recall, a significant improvement over previous sequence-based efforts. On an independent benchmark set, a current method, FINDSITE, based on structure features achieves an ROC of 0.81 with 54% precision at 50% recall, while LIBRUS achieves an ROC of 0.82 with 39% precision at 50% recall at a smaller computational cost. When LIBRUS and FINDSITE predictions are combined, performance is increased beyond either reaching an ROC of 0.86 and 59% precision at 50% recall. AVAILABILITY: Software developed for this study is available at http://bioinfo.cs.umn.edu/supplements/binf2009 along with Supplementary data on the study.


Subject(s)
Artificial Intelligence , Computational Biology/methods , Proteins/chemistry , Sequence Analysis, Protein/methods , Software , Binding Sites , Databases, Protein , Ligands , Proteins/metabolism , Sequence Homology, Amino Acid
12.
J Chem Inf Model ; 50(6): 979-91, 2010 Jun 28.
Article in English | MEDLINE | ID: mdl-20536191

ABSTRACT

With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active probes. However, many of them may be synthetically infeasible and, therefore, of limited value to drug developers. On the other hand, most of the tools for synthetic accessibility evaluation are very slow and can process only a few molecules per minute. In this study, we present two approaches to quickly predict the synthetic accessibility of chemical compounds by utilizing support vector machines operating on molecular descriptors. The first approach, RSsvm, is designed to identify the compounds that can be synthesized using a specific set of reactions and starting materials and builds its model by training on the compounds identified as synthetically accessible or not by retrosynthetic analysis. The second approach, DRsvm, is designed to provide a more general assessment of synthetic accessibility that is not tied to any set of reactions or starting materials. The training set compounds for this approach are selected from a diverse library based on the number of other similar compounds within the same library. Both approaches have been shown to perform very well in their corresponding areas of applicability with the RSsvm achieving a receiver operator characteristic score of 0.952 in cross-validation experiments and the DRsvm achieving a score of 0.888 on an independent set of compounds. Our implementations can successfully process thousands of compounds per minute.


Subject(s)
Artificial Intelligence , Drug Design , ROC Curve , Reproducibility of Results , Small Molecule Libraries/chemical synthesis , Small Molecule Libraries/chemistry
13.
BMC Bioinformatics ; 10: 439, 2009 Dec 22.
Article in English | MEDLINE | ID: mdl-20028521

ABSTRACT

BACKGROUND: Over the last decade several prediction methods have been developed for determining the structural and functional properties of individual protein residues using sequence and sequence-derived information. Most of these methods are based on support vector machines as they provide accurate and generalizable prediction models. RESULTS: We present a general purpose protein residue annotation toolkit (svmPRAT) to allow biologists to formulate residue-wise prediction problems. svmPRAT formulates the annotation problem as a classification or regression problem using support vector machines. One of the key features of svmPRAT is its ease of use in incorporating any user-provided information in the form of feature matrices. For every residue svmPRAT captures local information around the reside to create fixed length feature vectors. svmPRAT implements accurate and fast kernel functions, and also introduces a flexible window-based encoding scheme that accurately captures signals and pattern for training effective predictive models. CONCLUSIONS: In this work we evaluate svmPRAT on several classification and regression problems including disorder prediction, residue-wise contact order estimation, DNA-binding site prediction, and local structure alphabet prediction. svmPRAT has also been used for the development of state-of-the-art transmembrane helix prediction method called TOPTMH, and secondary structure prediction method called YASSPP. This toolkit developed provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems. AVAILABILITY: http://www.cs.gmu.edu/~mlbio/svmprat.


Subject(s)
Proteins/chemistry , Sequence Analysis, Protein/methods , Software , Artificial Intelligence , Binding Sites , Databases, Protein , Pattern Recognition, Automated , Protein Folding , Protein Structure, Secondary
14.
BMC Struct Biol ; 9: 41, 2009 Jun 30.
Article in English | MEDLINE | ID: mdl-19566958

ABSTRACT

BACKGROUND: Methods that can automatically assess the quality of computationally predicted protein structures are important, as they enable the selection of the most accurate structure from an ensemble of predictions. Assessment methods that determine the quality of a predicted structure by comparing it against the various structures predicted by different servers have been shown to outperform approaches that rely on the intrinsic characteristics of the structure itself. RESULTS: We examined techniques to estimate the quality of a predicted protein structure based on prediction consensus. LGA is used to align the structure in question to the structures for the same protein predicted by different servers. We examine both static (e.g. averaging) and dynamic (e.g. support vector machine) methods for aggregating these distances on two datasets. CONCLUSION: We find that a constrained regression approach shows consistently good performance. Although it is not always the absolute best performing scheme, it is always performs on par with the best schemes across multiple datasets. The work presented here provides the basis for the construction of a regression model trained on data from existing structure prediction servers.


Subject(s)
Protein Conformation , Algorithms , Computational Biology , Databases, Protein , Proteins/chemistry , Regression Analysis
15.
Biotechnol Bioeng ; 102(6): 1654-69, 2009 Apr 15.
Article in English | MEDLINE | ID: mdl-19132744

ABSTRACT

In the past decade we have witnessed a drastic increase in the productivity of mammalian cell culture-based processes. High-producing cell lines that synthesize and secrete these therapeutics have contributed largely to the advances in process development. To elucidate the productivity trait in the context of physiological functions, the transcriptomes of several NS0 cell lines with a wide range of antibody productivity were compared. Gene set testing (GST) analysis was used to identify pathways and biological functions that are altered in high producers. Three complementary tools for GST-gene set enrichment analysis (GSEA), gene set analysis (GSA), and MAPPFinder, were used to identify groups of functionally coherent genes that are up- or downregulated in high producers. Major functional classes identified include those involved in protein processing and transport, such as protein modification, vesicle trafficking, and protein turnover. A significant proportion of genes involved in mitochondrial ribosomal function, cell cycle regulation, cytoskeleton-related elements are also differentially altered in high producers. The observed correlation of these functional classes with productivity suggests that simultaneous modulation of several physiological functions is a potential route to high productivity.


Subject(s)
Gene Expression Profiling/methods , Gene Expression Regulation , Genes/physiology , Animals , Antibodies/metabolism , Cell Cycle Proteins/genetics , Cell Cycle Proteins/metabolism , Cell Line , Chromatin/genetics , Chromatin/metabolism , Cytoskeletal Proteins/genetics , Cytoskeletal Proteins/metabolism , Down-Regulation , Gene Regulatory Networks/physiology , Golgi Apparatus , Ligases/genetics , Ligases/metabolism , Mice , Mitochondrial Proteins/genetics , Mitochondrial Proteins/metabolism , Models, Biological , Oligonucleotide Array Sequence Analysis , Recombinant Proteins/metabolism , Ribosomal Proteins/genetics , Ribosomal Proteins/metabolism , Statistics, Nonparametric , Transport Vesicles/genetics , Transport Vesicles/metabolism , Up-Regulation
16.
J Chem Inf Model ; 49(10): 2190-201, 2009 Oct.
Article in English | MEDLINE | ID: mdl-19764745

ABSTRACT

In recent years, the development of computational techniques that identify all the likely targets for a given chemical compound, also termed as the problem of Target Fishing, has been an active area of research. Identification of likely targets of a chemical compound in the early stages of drug discovery helps to understand issues such as selectivity, off-target pharmacology, and toxicity. In this paper, we present a set of techniques whose goal is to rank or prioritize targets in the context of a given chemical compound so that most targets against which this compound may show activity appear higher in the ranked list. These methods are based on our extensions to the SVM and ranking perceptron algorithms for this problem. Our extensive experimental study shows that the methods developed in this work outperform previous approaches 2% to 60% under different evaluation criterions.


Subject(s)
Drug Discovery/methods , Animals , Artificial Intelligence , Bayes Theorem , Humans , Ligands , Models, Theoretical
17.
J Chem Inf Model ; 49(11): 2444-56, 2009 Nov.
Article in English | MEDLINE | ID: mdl-19842624

ABSTRACT

Structure-activity relationship (SAR) models are used to inform and to guide the iterative optimization of chemical leads, and they play a fundamental role in modern drug discovery. In this paper, we present a new class of methods for building SAR models, referred to as multi-assay based, that utilize activity information from different targets. These methods first identify a set of targets that are related to the target under consideration, and then they employ various machine learning techniques that utilize activity information from these targets in order to build the desired SAR model. We developed different methods for identifying the set of related targets, which take into account the primary sequence of the targets or the structure of their ligands, and we also developed different machine learning techniques that were derived by using principles of semi-supervised learning, multi-task learning, and classifier ensembles. The comprehensive evaluation of these methods shows that they lead to considerable improvements over the standard SAR models that are based only on the ligands of the target under consideration. On a set of 117 protein targets, obtained from PubChem, these multi-assay-based methods achieve a receiver-operating characteristic score that is, on the average, 7.0 -7.2% higher than that achieved by the standard SAR models. Moreover, on a set of targets belonging to six protein families, the multi-assay-based methods outperform chemogenomics-based approaches by 4.33%.


Subject(s)
Models, Chemical , Structure-Activity Relationship
18.
Nucleic Acids Res ; 35(21): 7222-36, 2007.
Article in English | MEDLINE | ID: mdl-17959654

ABSTRACT

Streptomyces spp. produce a variety of valuable secondary metabolites, which are regulated in a spatio-temporal manner by a complex network of inter-connected gene products. Using a compilation of genome-scale temporal transcriptome data for the model organism, Streptomyces coelicolor, under different environmental and genetic perturbations, we have developed a supervised machine-learning method for operon prediction in this microorganism. We demonstrate that, using features dependent on transcriptome dynamics and genome sequence, a support vector machines (SVM)-based classification algorithm can accurately classify >90% of gene pairs in a set of known operons. Based on model predictions for the entire genome, we verified the co-transcription of more than 250 gene pairs by RT-PCR. These results vastly increase the database of known operons in S. coelicolor and provide valuable information for exploring gene function and regulation to harness the potential of this differentiating microorganism for synthesis of natural products.


Subject(s)
Gene Expression Regulation, Bacterial , Operon , Streptomyces coelicolor/genetics , Transcription, Genetic , Artificial Intelligence , DNA, Intergenic/analysis , Gene Expression Profiling , Genome, Bacterial , Reproducibility of Results , Reverse Transcriptase Polymerase Chain Reaction , Terminator Regions, Genetic
19.
Trends Biotechnol ; 26(12): 690-9, 2008 Dec.
Article in English | MEDLINE | ID: mdl-18977046

ABSTRACT

Modern biotechnology production plants are equipped with sophisticated control, data logging and archiving systems. These data hold a wealth of information that might shed light on the cause of process outcome fluctuations, whether the outcome of concern is productivity or product quality. These data might also provide clues on means to further improve process outcome. Data-driven knowledge discovery approaches can potentially unveil hidden information, predict process outcome, and provide insights on implementing robust processes. Here we describe the steps involved in process data mining with an emphasis on recent advances in data mining methods pertinent to the unique characteristics of biological process data.


Subject(s)
Cell Physiological Phenomena , Computational Biology/methods , Database Management Systems , Databases, Factual , Information Storage and Retrieval/methods , Monitoring, Physiologic/methods
20.
Proteins ; 72(3): 1005-18, 2008 Aug 15.
Article in English | MEDLINE | ID: mdl-18300251

ABSTRACT

The effectiveness of comparative modeling approaches for protein structure prediction can be substantially improved by incorporating predicted structural information in the initial sequence-structure alignment. Motivated by the approaches used to align protein structures, this article focuses on developing machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identify high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel functions. Our comprehensive empirical study shows superior results compared with the profile-to-profile scoring schemes. We also show that for protein pairs with low sequence similarity (less than 12% sequence identity) these new local structural features alone or in conjunction with profile-based information lead to alignments that are considerably accurate than those obtained by schemes that use only profile and/or predicted secondary structure information.


Subject(s)
Algorithms , Proteins/chemistry , Sequence Analysis, Protein , Databases, Protein , Sequence Alignment
SELECTION OF CITATIONS
SEARCH DETAIL