Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 34
Filtrar
1.
Bioinformatics ; 36(Suppl_1): i39-i47, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657370

RESUMO

MOTIVATION: The human body hosts more microbial organisms than human cells. Analysis of this microbial diversity provides key insight into the role played by these microorganisms on human health. Metagenomics is the collective DNA sequencing of coexisting microbial organisms in an environmental sample or a host. This has several applications in precision medicine, agriculture, environmental science and forensics. State-of-the-art predictive models for phenotype predictions from metagenomic data rely on alignments, assembly, extensive pruning, taxonomic profiling and reference sequence databases. These processes are time consuming and they do not consider novel microbial sequences when aligned with the reference genome, limiting the potential of whole metagenomics. We formulate the problem of predicting human disease from whole-metagenomic data using Multiple Instance Learning (MIL), a popular supervised learning paradigm. Our proposed alignment-free approach provides higher accuracy in prediction by harnessing the capability of deep convolutional neural network (CNN) within a MIL framework and provides interpretability via neural attention mechanism. RESULTS: The MIL formulation combined with the hierarchical feature extraction capability of deep-CNN provides significantly better predictive performance compared to popular existing approaches. The attention mechanism allows for the identification of groups of sequences that are likely to be correlated to diseases providing the much-needed interpretation. Our proposed approach does not rely on alignment, assembly and reference sequence databases; making it fast and scalable for large-scale metagenomic data. We evaluate our method on well-known large-scale metagenomic studies and show that our proposed approach outperforms comparative state-of-the-art methods for disease prediction. AVAILABILITY AND IMPLEMENTATION: https://github.com/mrahma23/IDMIL.


Assuntos
Metagenoma , Metagenômica , Algoritmos , Bases de Dados de Ácidos Nucleicos , Humanos , Redes Neurais de Computação , Análise de Sequência de DNA
2.
ACS Omega ; 9(7): 7471-7479, 2024 Feb 20.
Artigo em Inglês | MEDLINE | ID: mdl-38405499

RESUMO

Computational prediction of molecule-protein interactions has been key for developing new molecules to interact with a target protein for therapeutics development. Previous work includes two independent streams of approaches: (1) predicting protein-protein interactions (PPIs) between naturally occurring proteins and (2) predicting binding affinities between proteins and small-molecule ligands [also known as drug-target interaction (DTI)]. Studying the two problems in isolation has limited the ability of these computational models to generalize across the PPI and DTI tasks, both of which ultimately involve noncovalent interactions with a protein target. In this work, we developed Equivariant Graph of Graphs neural Network (EGGNet), a geometric deep learning (GDL) framework, for molecule-protein binding predictions that can handle three types of molecules for interacting with a target protein: (1) small molecules, (2) synthetic peptides, and (3) natural proteins. EGGNet leverages a graph of graphs (GoG) representation constructed from the molecular structures at atomic resolution and utilizes a multiresolution equivariant graph neural network to learn from such representations. In addition, EGGNet leverages the underlying biophysics and makes use of both atom- and residue-level interactions, which improve EGGNet's ability to rank candidate poses from blind docking. EGGNet achieves competitive performance on both a public protein-small-molecule binding affinity prediction task (80.2% top 1 success rate on CASF-2016) and a synthetic protein interface prediction task (88.4% area under the precision-recall curve). We envision that the proposed GDL framework can generalize to many other protein interaction prediction problems, such as binding site prediction and molecular docking, helping accelerate protein engineering and structure-based drug development.

3.
Am J Physiol Gastrointest Liver Physiol ; 302(9): G966-78, 2012 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-22241860

RESUMO

Several studies indicate the importance of colonic microbiota in metabolic and inflammatory disorders and importance of diet on microbiota composition. The effects of alcohol, one of the prominent components of diet, on colonic bacterial composition is largely unknown. Mounting evidence suggests that gut-derived bacterial endotoxins are cofactors for alcohol-induced tissue injury and organ failure like alcoholic liver disease (ALD) that only occur in a subset of alcoholics. We hypothesized that chronic alcohol consumption results in alterations of the gut microbiome in a subgroup of alcoholics, and this may be responsible for the observed inflammatory state and endotoxemia in alcoholics. Thus we interrogated the mucosa-associated colonic microbiome in 48 alcoholics with and without ALD as well as 18 healthy subjects. Colonic biopsy samples from subjects were analyzed for microbiota composition using length heterogeneity PCR fingerprinting and multitag pyrosequencing. A subgroup of alcoholics have an altered colonic microbiome (dysbiosis). The alcoholics with dysbiosis had lower median abundances of Bacteroidetes and higher ones of Proteobacteria. The observed alterations appear to correlate with high levels of serum endotoxin in a subset of the samples. Network topology analysis indicated that alcohol use is correlated with decreased connectivity of the microbial network, and this alteration is seen even after an extended period of sobriety. We show that the colonic mucosa-associated bacterial microbiome is altered in a subset of alcoholics. The altered microbiota composition is persistent and correlates with endotoxemia in a subgroup of alcoholics.


Assuntos
Alcoolismo/microbiologia , Colo/microbiologia , Hepatopatias Alcoólicas/microbiologia , Metagenoma , Adulto , Idoso , Feminino , Humanos , Pessoa de Meia-Idade
4.
PLoS One ; 17(12): e0269509, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36584000

RESUMO

Opioid overdoses within the United States continue to rise and have been negatively impacting the social and economic status of the country. In order to effectively allocate resources and identify policy solutions to reduce the number of overdoses, it is important to understand the geographical differences in opioid overdose rates and their causes. In this study, we utilized data on emergency department opioid overdose (EDOOD) visits to explore the county-level spatio-temporal distribution of opioid overdose rates within the state of Virginia and their association with aggregate socio-ecological factors. The analyses were performed using a combination of techniques including Moran's I and multilevel modeling. Using data from 2016-2021, we found that Virginia counties had notable differences in their EDOOD visit rates with significant neighborhood-level associations: many counties in the southwestern region were consistently identified as the hotspots (areas with a higher concentration of EDOOD visits) whereas many counties in the northern region were consistently identified as the coldspots (areas with a lower concentration of EDOOD visits). In most Virginia counties, EDOOD visit rates declined from 2017 to 2018. In more recent years (since 2019), the visit rates showed an increasing trend. The multilevel modeling revealed that the change in clinical care factors (i.e., access to care and quality of care) and socio-economic factors (i.e., levels of education, employment, income, family and social support, and community safety) were significantly associated with the change in the EDOOD visit rates. The findings from this study have the potential to assist policymakers in proper resource planning thereby improving health outcomes.


Assuntos
Overdose de Drogas , Overdose de Opiáceos , Humanos , Estados Unidos , Analgésicos Opioides , Serviço Hospitalar de Emergência , Overdose de Drogas/epidemiologia , Virginia/epidemiologia
5.
BMC Genomics ; 12 Suppl 2: S8, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21989307

RESUMO

BACKGROUND: Metagenomic assembly is a challenging problem due to the presence of genetic material from multiple organisms. The problem becomes even more difficult when short reads produced by next generation sequencing technologies are used. Although whole genome assemblers are not designed to assemble metagenomic samples, they are being used for metagenomics due to the lack of assemblers capable of dealing with metagenomic samples. We present an evaluation of assembly of simulated short-read metagenomic samples using a state-of-art de Bruijn graph based assembler. RESULTS: We assembled simulated metagenomic reads from datasets of various complexities using a state-of-art de Bruijn graph based parallel assembler. We have also studied the effect of k-mer size used in de Bruijn graph on metagenomic assembly and developed a clustering solution to pool the contigs obtained from different assembly runs, which allowed us to obtain longer contigs. We have also assessed the degree of chimericity of the assembled contigs using an entropy/impurity metric and compared the metagenomic assemblies to assemblies of isolated individual source genomes. CONCLUSIONS: Our results show that accuracy of the assembled contigs was better than expected for the metagenomic samples with a few dominant organisms and was especially poor in samples containing many closely related strains. Clustering contigs from different k-mer parameter of the de Bruijn graph allowed us to obtain longer contigs, however the clustering resulted in accumulation of erroneous contigs thus increasing the error rate in clustered contigs.


Assuntos
Mapeamento de Sequências Contíguas , Genoma Bacteriano , Metagenoma , Análise de Sequência de DNA/métodos , Software , Algoritmos , Biologia Computacional , Gráficos por Computador/instrumentação , Simulação por Computador , Bases de Dados de Ácidos Nucleicos , Entropia , Escherichia coli/genética , Filogenia , Alinhamento de Sequência
6.
Chem Biodivers ; 7(5): 1040-50, 2010 May.
Artigo em Inglês | MEDLINE | ID: mdl-20491063

RESUMO

In this article, we used a network-based approach to characterize the microflora abundance in colonic mucosal samples and correlate potential interactions between the identified species with respect to the healthy and diseased states. We analyzed the modelled network by computing several local and global network statistics, identified recurring patterns or motifs, fit the network models to a family of well-studied graph models. This study has demonstrated, for the first time, an approach that differentiated the gut microbiota in alcoholic subjects and healthy subjects using topological network analysis of the gut microbiome.


Assuntos
Colo/microbiologia , Metagenoma , Modelos Biológicos , Humanos , Redes e Vias Metabólicas
7.
Chem Biodivers ; 7(5): 1076-85, 2010 May.
Artigo em Inglês | MEDLINE | ID: mdl-20491067

RESUMO

We have been using the Roche GS-FLX sequencing platform to produce tens of thousands of sequencing reads from samples of both bacterial communities (microbiome) and fungal communities (mycobiome) of stool, gut mucosa, vaginal washes, and oral washes from a large number of subjects. This vast volume of data from diverse sources has necessitated the development of an analysis pipeline in order to systematically and rapidly identify the taxa within the samples and to correlate the sample data with clinical and environmental features. Specifically, we have developed automated analytical tools for data tracking, taxonomical analysis, and feature clustering of bacteria in the human microbiome and demonstrate the pipeline using Cervical Vaginal Lavage (CVL) samples. This analysis pipeline will not only provide insight to our specific CVL dataset, but is applicable to other microbiome samples and will ultimately broaden our understanding of how the microbiome influences human health.


Assuntos
Bactérias/classificação , Metagenoma , Análise de Sequência de DNA , Vaginose Bacteriana/microbiologia , Bactérias/genética , Bactérias/isolamento & purificação , Colo do Útero , Análise por Conglomerados , Feminino , Infecções por HIV/complicações , Humanos , RNA Ribossômico 16S/genética , Irrigação Terapêutica , Vaginose Bacteriana/complicações
8.
Artigo em Inglês | MEDLINE | ID: mdl-28981422

RESUMO

The recent advent of Metagenome Wide Association Studies (MGWAS) provides insight into the role of microbes on human health and disease. However, the studies present several computational challenges. In this paper, we demonstrate a novel, efficient, and effective Multiple Instance Learning (MIL) based computational pipeline to predict patient phenotype from metagenomic data. MIL methods have the advantage that besides predicting the clinical phenotype, we can infer the instance level label or role of microbial sequence reads in the specific disease. Specifically, we use a Bag of Words method, which has been shown to be one of the most effective and efficient MIL methods. This involves assembly of the metagenomic sequence data, clustering of the assembled contigs, extracting features from the contigs, and using an SVM classifier to predict patient labels and identify the most relevant sequence clusters. With the exception of the given labels for the patients, this entire process is de novo (unsupervised). We call our pipeline "CAMIL", which stands for Clustering and Assembly with Multiple Instance Learning. We use multiple state-of-the-art clustering methods for feature extraction, evaluation, and comparison of the performance of our proposed approach for each of these clustering methods. We also present a fast and scalable pre-clustering algorithm as a preprocessing step for our proposed pipeline. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using locality sensitive hashing (LSH). These canopies are then refined by using state-of-the-art sequence clustering algorithms. We use data from a well-known MGWAS study of patients with Type-2 Diabetes and show that our pipeline significantly outperforms the classifier used in that paper, as well as other common MIL methods.


Assuntos
Aprendizado de Máquina , Metagenoma/genética , Metagenômica/métodos , Fenótipo , Análise por Conglomerados , Humanos
9.
BMC Bioinformatics ; 10: 439, 2009 Dec 22.
Artigo em Inglês | MEDLINE | ID: mdl-20028521

RESUMO

BACKGROUND: Over the last decade several prediction methods have been developed for determining the structural and functional properties of individual protein residues using sequence and sequence-derived information. Most of these methods are based on support vector machines as they provide accurate and generalizable prediction models. RESULTS: We present a general purpose protein residue annotation toolkit (svmPRAT) to allow biologists to formulate residue-wise prediction problems. svmPRAT formulates the annotation problem as a classification or regression problem using support vector machines. One of the key features of svmPRAT is its ease of use in incorporating any user-provided information in the form of feature matrices. For every residue svmPRAT captures local information around the reside to create fixed length feature vectors. svmPRAT implements accurate and fast kernel functions, and also introduces a flexible window-based encoding scheme that accurately captures signals and pattern for training effective predictive models. CONCLUSIONS: In this work we evaluate svmPRAT on several classification and regression problems including disorder prediction, residue-wise contact order estimation, DNA-binding site prediction, and local structure alphabet prediction. svmPRAT has also been used for the development of state-of-the-art transmembrane helix prediction method called TOPTMH, and secondary structure prediction method called YASSPP. This toolkit developed provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems. AVAILABILITY: http://www.cs.gmu.edu/~mlbio/svmprat.


Assuntos
Proteínas/química , Análise de Sequência de Proteína/métodos , Software , Inteligência Artificial , Sítios de Ligação , Bases de Dados de Proteínas , Reconhecimento Automatizado de Padrão , Dobramento de Proteína , Estrutura Secundária de Proteína
10.
J Chem Inf Model ; 49(11): 2444-56, 2009 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-19842624

RESUMO

Structure-activity relationship (SAR) models are used to inform and to guide the iterative optimization of chemical leads, and they play a fundamental role in modern drug discovery. In this paper, we present a new class of methods for building SAR models, referred to as multi-assay based, that utilize activity information from different targets. These methods first identify a set of targets that are related to the target under consideration, and then they employ various machine learning techniques that utilize activity information from these targets in order to build the desired SAR model. We developed different methods for identifying the set of related targets, which take into account the primary sequence of the targets or the structure of their ligands, and we also developed different machine learning techniques that were derived by using principles of semi-supervised learning, multi-task learning, and classifier ensembles. The comprehensive evaluation of these methods shows that they lead to considerable improvements over the standard SAR models that are based only on the ligands of the target under consideration. On a set of 117 protein targets, obtained from PubChem, these multi-assay-based methods achieve a receiver-operating characteristic score that is, on the average, 7.0 -7.2% higher than that achieved by the standard SAR models. Moreover, on a set of targets belonging to six protein families, the multi-assay-based methods outperform chemogenomics-based approaches by 4.33%.


Assuntos
Modelos Químicos , Relação Estrutura-Atividade
11.
Proteins ; 72(3): 1005-18, 2008 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-18300251

RESUMO

The effectiveness of comparative modeling approaches for protein structure prediction can be substantially improved by incorporating predicted structural information in the initial sequence-structure alignment. Motivated by the approaches used to align protein structures, this article focuses on developing machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identify high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel functions. Our comprehensive empirical study shows superior results compared with the profile-to-profile scoring schemes. We also show that for protein pairs with low sequence similarity (less than 12% sequence identity) these new local structural features alone or in conjunction with profile-based information lead to alignments that are considerably accurate than those obtained by schemes that use only profile and/or predicted secondary structure information.


Assuntos
Algoritmos , Proteínas/química , Análise de Sequência de Proteína , Bases de Dados de Proteínas , Alinhamento de Sequência
12.
Bioinformatics ; 23(2): e17-23, 2007 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-17237087

RESUMO

MOTIVATION: Protein sequence alignment plays a critical role in computational biology as it is an integral part in many analysis tasks designed to solve problems in comparative genomics, structure and function prediction, and homology modeling. METHODS: We have developed novel sequence alignment algorithms that compute the alignment between a pair of sequences based on short fixed- or variable-length high-scoring subsequences. Our algorithms build the alignments by repeatedly selecting the highest scoring pairs of subsequences and using them to construct small portions of the final alignment. We utilize PSI-BLAST generated sequence profiles and employ a profile-to-profile scoring scheme derived from PICASSO. RESULTS: We evaluated the performance of the computed alignments on two recently published benchmark datasets and compared them against the alignments computed by existing state-of-the-art dynamic programming-based profile-to-profile local and global sequence alignment algorithms. Our results show that the new algorithms achieve alignments that are comparable with or better than those achieved by existing algorithms. Moreover, our results also showed that these algorithms can be used to provide better information as to which of the aligned positions are more reliable--a critical piece of information for comparative modeling applications.


Assuntos
Algoritmos , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Dados de Sequência Molecular
13.
J Bioinform Comput Biol ; 15(6): 1740006, 2017 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-29113561

RESUMO

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a.


Assuntos
Algoritmos , Biodiversidade , Metagenoma , Metagenômica/métodos , Análise por Conglomerados , Bases de Dados Factuais , Microbioma Gastrointestinal/genética , Humanos , Cirrose Hepática/microbiologia , Microbiota , Filogenia , RNA Ribossômico 16S , RNA Ribossômico 18S , Análise de Sequência de RNA/métodos , Microbiologia do Solo
14.
BMC Bioinformatics ; 7: 455, 2006 Oct 16.
Artigo em Inglês | MEDLINE | ID: mdl-17042943

RESUMO

BACKGROUND: Protein remote homology detection and fold recognition are central problems in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for solving these problems. These methods are primarily used to solve binary classification problems and they have not been extensively used to solve the more general multiclass remote homology prediction and fold recognition problems. RESULTS: We present a comprehensive evaluation of a number of methods for building SVM-based multiclass classification schemes in the context of the SCOP protein classification. These methods include schemes that directly build an SVM-based multiclass model, schemes that employ a second-level learning approach to combine the predictions generated by a set of binary SVM-based classifiers, and schemes that build and combine binary classifiers for various levels of the SCOP hierarchy beyond those defining the target classes. CONCLUSION: Analyzing the performance achieved by the different approaches on four different datasets we show that most of the proposed multiclass SVM-based classification approaches are quite effective in solving the remote homology prediction and fold recognition problems and that the schemes that use predictions from binary models constructed for ancestral categories within the SCOP hierarchy tend to not only lead to lower error rates but also reduce the number of errors in which a superfamily is assigned to an entirely different fold and a fold is predicted as being from a different SCOP class. Our results also show that the limited size of the training data makes it hard to learn complex second-level models, and that models of moderate complexity lead to consistently better results.


Assuntos
Algoritmos , Dobramento de Proteína , Proteínas/química , Proteínas/classificação , Análise de Sequência de Proteína/classificação , Homologia de Sequência de Aminoácidos , Estrutura Secundária de Proteína
15.
Annu Int Conf IEEE Eng Med Biol Soc ; 2016: 3219-3222, 2016 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-28268993

RESUMO

Advancements in multiarticulate upper-limb prosthetics have outpaced the development of intuitive, non-invasive control mechanisms for implementing them. Surface electromyography is currently the most popular non-invasive control method, but presents a number of drawbacks including poor deep-muscle specificity. Previous research established the viability of ultrasound imaging as an alternative means of decoding movement intent, and demonstrated the ability to distinguish between complex grasps in able-bodied subjects via imaging of the anterior forearm musculature. In order to translate this work to clinical viability, able-bodied testing is insufficient. Amputation-induced changes in muscular geometry, dynamics, and imaging characteristics are all likely to influence the effectiveness of our existing techniques. In this work, we conducted preliminary trials with a transradial amputee participant to assess these effects, and potentially elucidate necessary refinements to our approach. Two trials were performed, the first using a set of three motion types, and the second using four. After a brief training period in each trial, the participant was able to control a virtual prosthetic hand in real-time; attempted grasps were successfully classified with a rate of 77% in trial 1, and 71% in trial 2. While the results are sub-optimal compared to our previous able-bodied testing, they are a promising step forward. More importantly, the data collected during these trials can provide valuable information for refining our image processing methods, especially via comparison to previously acquired data from able-bodied individuals. Ultimately, further work with amputees is a necessity for translation towards clinical application.


Assuntos
Amputados , Membros Artificiais , Sistemas Computacionais , Ultrassonografia/métodos , Eletromiografia , Humanos , Processamento de Imagem Assistida por Computador , Movimento
16.
IEEE Trans Biomed Eng ; 63(8): 1687-98, 2016 08.
Artigo em Inglês | MEDLINE | ID: mdl-26560865

RESUMO

Surface electromyography (sEMG) has been the predominant method for sensing electrical activity for a number of applications involving muscle-computer interfaces, including myoelectric control of prostheses and rehabilitation robots. Ultrasound imaging for sensing mechanical deformation of functional muscle compartments can overcome several limitations of sEMG, including the inability to differentiate between deep contiguous muscle compartments, low signal-to-noise ratio, and lack of a robust graded signal. The objective of this study was to evaluate the feasibility of real-time graded control using a computationally efficient method to differentiate between complex hand motions based on ultrasound imaging of forearm muscles. Dynamic ultrasound images of the forearm muscles were obtained from six able-bodied volunteers and analyzed to map muscle activity based on the deformation of the contracting muscles during different hand motions. Each participant performed 15 different hand motions, including digit flexion, different grips (i.e., power grasp and pinch grip), and grips in combination with wrist pronation. During the training phase, we generated a database of activity patterns corresponding to different hand motions for each participant. During the testing phase, novel activity patterns were classified using a nearest neighbor classification algorithm based on that database. The average classification accuracy was 91%. Real-time image-based control of a virtual hand showed an average classification accuracy of 92%. Our results demonstrate the feasibility of using ultrasound imaging as a robust muscle-computer interface. Potential clinical applications include control of multiarticulated prosthetic hands, stroke rehabilitation, and fundamental investigations of motor control and biomechanics.


Assuntos
Antebraço/fisiologia , Mãos/fisiologia , Processamento de Imagem Assistida por Computador/métodos , Músculo Esquelético/fisiologia , Ultrassonografia/métodos , Algoritmos , Feminino , Força da Mão/fisiologia , Humanos , Masculino , Movimento/fisiologia
17.
Artigo em Inglês | MEDLINE | ID: mdl-26357091

RESUMO

High-throughput experimental techniques provide a wide variety of heterogeneous proteomic data sources. To exploit the information spread across multiple sources for protein function prediction, these data sources are transformed into kernels and then integrated into a composite kernel. Several methods first optimize the weights on these kernels to produce a composite kernel, and then train a classifier on the composite kernel. As such, these approaches result in an optimal composite kernel, but not necessarily in an optimal classifier. On the other hand, some approaches optimize the loss of binary classifiers and learn weights for the different kernels iteratively. For multi-class or multi-label data, these methods have to solve the problem of optimizing weights on these kernels for each of the labels, which are computationally expensive and ignore the correlation among labels. In this paper, we propose a method called Predicting Protein Function using Multiple Kernels (ProMK). ProMK iteratively optimizes the phases of learning optimal weights and reduces the empirical loss of multi-label classifier for each of the labels simultaneously. ProMK can integrate kernels selectively and downgrade the weights on noisy kernels. We investigate the performance of ProMK on several publicly available protein function prediction benchmarks and synthetic datasets. We show that the proposed approach performs better than previously proposed protein function prediction approaches that integrate multiple data sources and multi-label multiple kernel learning methods. The codes of our proposed method are available at https://sites.google.com/site/guoxian85/promk.


Assuntos
Algoritmos , Proteínas/química , Proteínas/classificação , Proteômica/métodos , Animais , Humanos , Camundongos , Mapas de Interação de Proteínas , Análise de Sequência de Proteína , Leveduras/genética
18.
Int J Bioinform Res Appl ; 11(2): 111-29, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25786791

RESUMO

The human gut is one of the most densely populated microbial communities in the world. The interaction of microbes with human host cells is responsible for several disease conditions and of criticality to human health. It is imperative to understand the relationships between these microbial communities within the human gut and their roles in disease. In this study we analyse the microbial communities within the human gut and their role in Inflammatory Bowel Disease (IBD). The bacterial communities were interrogated using Length Heterogeneity PCR (LH-PCR) fingerprinting of mucosal and luminal associated microbial communities for a class of healthy and diseases patients.


Assuntos
Bactérias/genética , Bactérias/isolamento & purificação , Doenças Inflamatórias Intestinais/microbiologia , Mucosa Intestinal/microbiologia , Reconhecimento Automatizado de Padrão/métodos , Reação em Cadeia da Polimerase/métodos , Algoritmos , Humanos , Microbiota/genética , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Máquina de Vetores de Suporte
19.
Artigo em Inglês | MEDLINE | ID: mdl-26357046

RESUMO

Classification problems in which several learning tasks are organized hierarchically pose a special challenge because the hierarchical structure of the problems needs to be considered. Multi-task learning (MTL) provides a framework for dealing with such interrelated learning tasks. When two different hierarchical sources organize similar information, in principle, this combined knowledge can be exploited to further improve classification performance. We have studied this problem in the context of protein structure classification by integrating the learning process for two hierarchical protein structure classification database, SCOP and CATH. Our goal is to accurately predict whether a given protein belongs to a particular class in these hierarchies using only the amino acid sequences. We have utilized the recent developments in multi-task learning to solve the interrelated classification problems. We have also evaluated how the various relationships between tasks affect the classification performance. Our evaluations show that learning schemes in which both the classification databases are used outperform the schemes which utilize only one of them.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Proteínas/classificação , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Inteligência Artificial , Bases de Dados de Proteínas
20.
Artigo em Inglês | MEDLINE | ID: mdl-26356025

RESUMO

Automated protein function prediction is one of the grand challenges in computational biology. Multi-label learning is widely used to predict functions of proteins. Most of multi-label learning methods make prediction for unlabeled proteins under the assumption that the labeled proteins are completely annotated, i.e., without any missing functions. However, in practice, we may have a subset of the ground-truth functions for a protein, and whether the protein has other functions is unknown. To predict protein functions with incomplete annotations, we propose a Protein Function Prediction method with Weak-label Learning (ProWL) and its variant ProWL-IF. Both ProWL and ProWL-IF can replenish the missing functions of proteins. In addition, ProWL-IF makes use of the knowledge that a protein cannot have certain functions, which can further boost the performance of protein function prediction. Our experimental results on protein-protein interaction networks and gene expression benchmarks validate the effectiveness of both ProWL and ProWL-IF.


Assuntos
Biologia Computacional/métodos , Modelos Estatísticos , Proteínas/classificação , Proteínas/metabolismo , Anotação de Sequência Molecular/métodos , Mapas de Interação de Proteínas/genética , Proteínas/genética , Transcriptoma/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA