Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
1.
PLoS One ; 19(1): e0296627, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38241279

RESUMO

Machine learning was shown to be effective at identifying distinctive genomic signatures among viral sequences. These signatures are defined as pervasive motifs in the viral genome that allow discrimination between species or variants. In the context of SARS-CoV-2, the identification of these signatures can assist in taxonomic and phylogenetic studies, improve in the recognition and definition of emerging variants, and aid in the characterization of functional properties of polymorphic gene products. In this paper, we assess KEVOLVE, an approach based on a genetic algorithm with a machine-learning kernel, to identify multiple genomic signatures based on minimal sets of k-mers. In a comparative study, in which we analyzed large SARS-CoV-2 genome dataset, KEVOLVE was more effective at identifying variant-discriminative signatures than several gold-standard statistical tools. Subsequently, these signatures were characterized using a new extension of KEVOLVE (KANALYZER) to highlight variations of the discriminative signatures among different classes of variants, their genomic location, and the mutations involved. The majority of identified signatures were associated with known mutations among the different variants, in terms of functional and pathological impact based on available literature. Here we showed that KEVOLVE is a robust machine learning approach to identify discriminative signatures among SARS-CoV-2 variants, which are frequently also biologically relevant, while bypassing multiple sequence alignments. The source code of the method and additional resources are available at: https://github.com/bioinfoUQAM/KEVOLVE.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Filogenia , COVID-19/diagnóstico , COVID-19/genética , Genômica , Aprendizado de Máquina
2.
Bioinformatics ; 38(16): 3984-3991, 2022 08 10.
Artigo em Inglês | MEDLINE | ID: mdl-35762945

RESUMO

MOTIVATION: Precise identification of Biosynthetic Gene Clusters (BGCs) is a challenging task. Performance of BGC discovery tools is limited by their capacity to accurately predict components belonging to candidate BGCs, often overestimating cluster boundaries. To support optimizing the composition and boundaries of candidate BGCs, we propose reinforcement learning approach relying on protein domains and functional annotations from expert curated BGCs. RESULTS: The proposed reinforcement learning method aims to improve candidate BGCs obtained with state-of-the-art tools. It was evaluated on candidate BGCs obtained for two fungal genomes, Aspergillus niger and Aspergillus nidulans. The results highlight an improvement of the gene precision by above 15% for TOUCAN, fungiSMASH and DeepBGC; and cluster precision by above 25% for fungiSMASH and DeepBCG, allowing these tools to obtain almost perfect precision in cluster prediction. This can pave the way of optimizing current prediction of candidate BGCs in fungi, while minimizing the curation effort required by domain experts. AVAILABILITY AND IMPLEMENTATION: https://github.com/bioinfoUQAM/RL-bgc-components. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Fungos , Família Multigênica , Fungos/genética , Genoma Fúngico , Vias Biossintéticas/genética
4.
NAR Genom Bioinform ; 2(4): lqaa098, 2020 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-33575642

RESUMO

Fungal secondary metabolites (SMs) are an important source of numerous bioactive compounds largely applied in the pharmaceutical industry, as in the production of antibiotics and anticancer medications. The discovery of novel fungal SMs can potentially benefit human health. Identifying biosynthetic gene clusters (BGCs) involved in the biosynthesis of SMs can be a costly and complex task, especially due to the genomic diversity of fungal BGCs. Previous studies on fungal BGC discovery present limited scope and can restrict the discovery of new BGCs. In this work, we introduce TOUCAN, a supervised learning framework for fungal BGC discovery. Unlike previous methods, TOUCAN is capable of predicting BGCs on amino acid sequences, facilitating its use on newly sequenced and not yet curated data. It relies on three main pillars: rigorous selection of datasets by BGC experts; combination of functional, evolutionary and compositional features coupled with outperforming classifiers; and robust post-processing methods. TOUCAN best-performing model yields 0.982 F-measure on BGC regions in the Aspergillus niger genome. Overall results show that TOUCAN outperforms previous approaches. TOUCAN focuses on fungal BGCs but can be easily adapted to expand its scope to process other species or include new features.

6.
J Comput Biol ; 26(6): 519-535, 2019 06.
Artigo em Inglês | MEDLINE | ID: mdl-31050550

RESUMO

The classification of pathogens in emerging and re-emerging viruses represents major interests in taxonomic studies, functional genomics, host-pathogen interplay, prevention, and disease treatments. It consists of assigning a given sequence to its related group of known sequences sharing similar characteristics and traits. The challenges to such classification could be associated with several virus properties including recombination, mutation rate, multiplicity of motifs, and diversity. In domains such as pathogen monitoring and surveillance, it is important to detect and quantify known and novel taxa without exploiting the full and accurate alignments or virus family profiles. In this study, we propose an alignment-free method, CASTOR-KRFE, to detect discriminating subsequences within known pathogen sequences to classify accurately unknown pathogen sequences. This method includes three major steps: (1) vectorization of known viral genomic sequences based on k-mers to constitute the potential features, (2) efficient way of pattern extraction and evaluation maximizing classification performance, and (3) prediction of the minimal set of features fitting a given criterion (threshold of performance metric and maximum number of features). We assessed this method through a jackknife data partitioning on a dozen of various virus data sets, covering the seven major virus groups and including influenza virus, Ebola virus, human immunodeficiency virus 1, hepatitis C virus, hepatitis B virus, and human papillomavirus. CASTOR-KRFE provides a weighted average F-measure >0.96 over a wide range of viruses. Our method also shows better performance on complex virus data sets than multiple subsequences extractor for classification (MISSEL), a subsequence extraction method, and the Discriminative mode of MEME patterns extraction tool.


Assuntos
Genoma Viral/genética , Análise de Sequência de DNA/métodos , Vírus/genética , Algoritmos , Genômica/métodos , Humanos , Alinhamento de Sequência/métodos
7.
Artigo em Inglês | MEDLINE | ID: mdl-29994265

RESUMO

This paper introduces a method for automatic workflow extraction from texts using Process-Oriented Case-Based Reasoning (POCBR). While the current workflow management systems implement mostly different complicated graphical tasks based on advanced distributed solutions (e.g. cloud computing and grid computation), workflow knowledge acquisition from texts using case-based reasoning represents more expressive and semantic cases representations. We propose in this context, an ontology-based workflow extraction framework to acquire processual knowledge from texts. Our methodology extends classic NLP techniques to extract and disambiguate tasks and relations in texts. Using a graph-based representation of workflows and a domain ontology, our extraction process uses a context-aware approach to recognize workflow components: data and control flows. We applied our framework in a technical domain in bioinformatics: i.e. phylogenetic analyses. An evaluation based on workflow semantic similarities on a gold standard proves that our approach provides promising results in the process extraction domain. Both data and implementation of our framework are available in: http://labo.bioinfo.uqam.ca/tgrowler.

8.
Infect Genet Evol ; 62: 141-150, 2018 08.
Artigo em Inglês | MEDLINE | ID: mdl-29678797

RESUMO

Pregnancy is associated with modulations of maternal immunity that contribute to foeto-maternal tolerance. To understand whether and how these alterations impact antiviral immunity, a detailed cross-sectional analysis of selective pressures exerted on HIV-1 envelope amino-acid sequences was performed in a group of pregnant (n = 32) and non-pregnant (n = 44) HIV-infected women in absence of treatment with antiretroviral therapy (ART). Independent of HIV-1 subtype, p-distance, dN and dS were all strongly correlated with one another but were not significantly different in pregnant as compared to non-pregnant patients. Differential levels of selective pressure applied on different Env subdomains displayed similar yet non-identical patterns between the two groups, with pressure applied on C1 being significantly lower in constant regions C1 and C2 than in V1, V2, V3 and C3. To draw a general picture of the selection applied on the envelope and compensate for inter-individual variations, we performed a binomial test on selection frequency data pooled from pregnant and non-pregnant women. This analysis uncovered 42 positions, present in both groups, exhibiting statistically-significant frequency of selection that invariably mapped to the surface of the Env protein, with the great majority located within epitopes recognized by Env-specific antibodies or sites associated with the development of cross-reactive neutralizing activity. The median frequency of occurrence of positive selection per site was significantly lower in pregnant versus non-pregnant women. Furthermore, examination of the distribution of positively selected sites using a hypergeometric test revealed that only 2 positions (D137 and S142) significantly differed between the 2 groups. Taken together, these result indicate that pregnancy is associated with subtle yet distinctive changes in selective pressures exerted on the HIV-1 Env protein that are compatible with transient modulations of maternal immunity.


Assuntos
Infecções por HIV/virologia , HIV-1/genética , Complicações Infecciosas na Gravidez/virologia , Produtos do Gene env do Vírus da Imunodeficiência Humana/genética , Evolução Molecular , Feminino , Humanos , Modelos Moleculares , Gravidez , Conformação Proteica , Seleção Genética
9.
Plant Physiol ; 176(3): 2376-2394, 2018 03.
Artigo em Inglês | MEDLINE | ID: mdl-29259104

RESUMO

Cold acclimation and winter survival in cereal species is determined by complicated environmentally regulated gene expression. However, studies investigating these complex cold responses are mostly conducted in controlled environments that only consider the responses to single environmental variables. In this study, we have comprehensively profiled global transcriptional responses in crowns of field-grown spring and winter wheat (Triticum aestivum) genotypes and their near-isogenic lines with the VRN-A1 alleles swapped. This in-depth analysis revealed multiple signaling, interactive pathways that influence cold tolerance and phenological development to optimize plant growth and development in preparation for a wide range of over-winter stresses. Investigation of genetic differences at the VRN-A1 locus revealed that a vernalization requirement maintained a higher level of cold response pathways while VRN-A1 genetically promoted floral development. Our results also demonstrated the influence of genetic background on the expression of cold and flowering pathways. The link between delayed shoot apex development and the induction of cold tolerance was reflected by the gradual up-regulation of abscisic acid-dependent and C-REPEAT-BINDING FACTOR pathways. This was accompanied by the down-regulation of key genes involved in meristem development as the autumn progressed. The chromosome location of differentially expressed genes between the winter and spring wheat genetic backgrounds showed a striking pattern of biased gene expression on chromosomes 6A and 6D, indicating a transcriptional regulation at the genome level. This finding adds to the complexity of the genetic cascades and gene interactions that determine the evolutionary patterns of both phenological development and cold tolerance traits in wheat.


Assuntos
Aclimatação/genética , Regulação da Expressão Gênica de Plantas , Triticum/fisiologia , Alelos , Parede Celular/genética , Parede Celular/metabolismo , Cromossomos de Plantas , Análise por Conglomerados , Resposta ao Choque Frio/genética , Flores/genética , Perfilação da Expressão Gênica , Genótipo , Redes e Vias Metabólicas/genética , Polimorfismo Genético , Saskatchewan , Triticum/genética , Triticum/crescimento & desenvolvimento
11.
J Comput Biol ; 24(8): 799-808, 2017 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-28742392

RESUMO

Contemporary workflow management systems are driven by explicit process models specifying the interdependencies between tasks. Creating these models is a challenging and time-consuming task. Existing approaches to mining concrete workflows into models tackle design aspects related to the diverging abstraction levels of the tasks. Concrete workflow logs represent tasks and cases of concrete events-partially or totally ordered-grounding hidden multilevel (abstract) semantics and contexts. Relevant generalized events could be rediscovered within these processes. We propose, in this article, an ontology-based workflow mining system to generate patterns from sequences of events that are themselves extracted from texts. Our system T-GOWler (Generalized Ontology-based WorkfLow minER within Texts) is based on two ontology-based modules: a workflow extractor and a pattern miner. To this end, it uses two different ontologies: a domain one (to support workflow extraction from texts) and a processual one (to mine generalized patterns from extracted workflows).


Assuntos
Algoritmos , Biologia Computacional/métodos , Mineração de Dados/métodos , Ontologia Genética , Semântica , Humanos , Filogenia , Fluxo de Trabalho
12.
BMC Bioinformatics ; 18(1): 208, 2017 Apr 11.
Artigo em Inglês | MEDLINE | ID: mdl-28399797

RESUMO

BACKGROUND: Advances in cloning and sequencing technology are yielding a massive number of viral genomes. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific well-studied family of viruses. Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families. RESULTS: Here, we introduce a virus classification platform, CASTOR, based on machine learning methods. CASTOR is inspired by a well-known technique in molecular biology: restriction fragment length polymorphism (RFLP). It simulates, in silico, the restriction digestion of genomic material by different enzymes into fragments. It uses two metrics to construct feature vectors for machine learning algorithms in the classification step. We benchmark CASTOR for the classification of distinct datasets of human papillomaviruses (HPV), hepatitis B viruses (HBV) and human immunodeficiency viruses type 1 (HIV-1). Results reveal true positive rates of 99%, 99% and 98% for HPV Alpha species, HBV genotyping and HIV-1 M subtyping, respectively. Furthermore, CASTOR shows a competitive performance compared to well-known HIV-1 specific classifiers (REGA and COMET) on whole genomes and pol fragments. CONCLUSION: The performance of CASTOR, its genericity and robustness could permit to perform novel and accurate large scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine learning classifiers. CASTOR can be accessed at http://castor.bioinfo.uqam.ca .


Assuntos
Genoma Viral , Genômica/métodos , Aprendizado de Máquina , Classificação , Simulação por Computador , HIV-1/genética , Vírus da Hepatite B/genética , Humanos , Papillomaviridae/genética , Análise de Sequência de DNA/métodos , Análise de Sequência de RNA/métodos
13.
Nucleic Acids Res ; 45(2): 556-566, 2017 01 25.
Artigo em Inglês | MEDLINE | ID: mdl-27899600

RESUMO

MicroRNAs (miRNA) are short single-stranded RNA molecules derived from hairpin-forming precursors that play a crucial role as post-transcriptional regulators in eukaryotes and viruses. In the past years, many microRNA target genes (MTGs) have been identified experimentally. However, because of the high costs of experimental approaches, target genes databases remain incomplete. Although several target prediction programs have been developed in the recent years to identify MTGs in silico, their specificity and sensitivity remain low. Here, we propose a new approach called MirAncesTar, which uses ancestral genome reconstruction to boost the accuracy of existing MTGs prediction tools for human miRNAs. For each miRNA and each putative human target UTR, our algorithm makes uses of existing prediction tools to identify putative target sites in the human UTR, as well as in its mammalian orthologs and inferred ancestral sequences. It then evaluates evidence in support of selective pressure to maintain target site counts (rather than sequences), accounting for the possibility of target site turnover. It finally integrates this measure with several simpler ones using a logistic regression predictor. MirAncesTar improves the accuracy of existing MTG predictors by 26% to 157%. Source code and prediction results for human miRNAs, as well as supporting evolutionary data are available at http://cs.mcgill.ca/∼blanchem/mirancestar.


Assuntos
Biologia Computacional/métodos , MicroRNAs/genética , Interferência de RNA , RNA Mensageiro/genética , Algoritmos , Animais , Sítios de Ligação , Simulação por Computador , Humanos , MicroRNAs/química , RNA Mensageiro/química
14.
BioData Min ; 9: 30, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27688811

RESUMO

BACKGROUND: Studying the functions and structures of proteins is important for understanding the molecular mechanisms of life. The number of publicly available protein structures has increasingly become extremely large. Still, the classification of a protein structure remains a difficult, costly, and time consuming task. The difficulties are often due to the essential role of spatial and topological structures in the classification of protein structures. RESULTS: We propose ProtNN, a novel classification approach for protein 3D-structures. Given an unannotated query protein structure and a set of annotated proteins, ProtNN assigns to the query protein the class with the highest number of votes across the k nearest neighbor reference proteins, where k is a user-defined parameter. The search of the nearest neighbor annotated structures is based on a protein-graph representation model and pairwise similarities between vector embedding of the query and the reference protein structures in structural and topological spaces. CONCLUSIONS: We demonstrate through an extensive experimental evaluation that ProtNN is able to accurately classify several datasets in an extremely fast runtime compared to state-of-the-art approaches. We further show that ProtNN is able to scale up to a whole PDB dataset in a single-process mode with no parallelization, with a gain of thousands order of magnitude in runtime compared to state-of-the-art approaches.

15.
BMC Bioinformatics ; 16: 68, 2015 Mar 03.
Artigo em Inglês | MEDLINE | ID: mdl-25887434

RESUMO

BACKGROUND: Workflows, or computational pipelines, consisting of collections of multiple linked tasks are becoming more and more popular in many scientific fields, including computational biology. For example, simulation studies, which are now a must for statistical validation of new bioinformatics methods and software, are frequently carried out using the available workflow platforms. Workflows are typically organized to minimize the total execution time and to maximize the efficiency of the included operations. Clustering algorithms can be applied either for regrouping similar workflows for their simultaneous execution on a server, or for dispatching some lengthy workflows to different servers, or for classifying the available workflows with a view to performing a specific keyword search. RESULTS: In this study, we consider four different workflow encoding and clustering schemes which are representative for bioinformatics projects. Some of them allow for clustering workflows with similar topological features, while the others regroup workflows according to their specific attributes (e.g. associated keywords) or execution time. The four types of workflow encoding examined in this study were compared using the weighted versions of k-means and k-medoids partitioning algorithms. The Calinski-Harabasz, Silhouette and logSS clustering indices were considered. Hierarchical classification methods, including the UPGMA, Neighbor Joining, Fitch and Kitsch algorithms, were also applied to classify bioinformatics workflows. Moreover, a novel pairwise measure of clustering solution stability, which can be computed in situations when a series of independent program runs is carried out, was introduced. CONCLUSIONS: Our findings based on the analysis of 220 real-life bioinformatics workflows suggest that the weighted clustering models based on keywords information or tasks execution times provide the most appropriate clustering solutions. Using datasets generated by the Armadillo and Taverna scientific workflow management system, we found that the weighted cosine distance in association with the k-medoids partitioning algorithm and the presence-absence workflow encoding provided the highest values of the Rand index among all compared clustering strategies. The introduced clustering stability indices, PS and PSG, can be effectively used to identify elements with a low clustering support.


Assuntos
Algoritmos , Biologia Computacional/métodos , Software , Fluxo de Trabalho , Análise por Conglomerados , Conjuntos de Dados como Assunto , Filogenia
16.
BMC Genomics ; 16: 339, 2015 Apr 24.
Artigo em Inglês | MEDLINE | ID: mdl-25903161

RESUMO

BACKGROUND: Wheat is a major staple crop with broad adaptability to a wide range of environmental conditions. This adaptability involves several stress and developmentally responsive genes, in which microRNAs (miRNAs) have emerged as important regulatory factors. However, the currently used approaches to identify miRNAs in this polyploid complex system focus on conserved and highly expressed miRNAs avoiding regularly those that are often lineage-specific, condition-specific, or appeared recently in evolution. In addition, many environmental and biological factors affecting miRNA expression were not yet considered, resulting still in an incomplete repertoire of wheat miRNAs. RESULTS: We developed a conservation-independent technique based on an integrative approach that combines machine learning, bioinformatic tools, biological insights of known miRNA expression profiles and universal criteria of plant miRNAs to identify miRNAs with more confidence. The developed pipeline can potentially identify novel wheat miRNAs that share features common to several species or that are species specific or clade specific. It allowed the discovery of 199 miRNA candidates associated with different abiotic stresses and development stages. We also highlight from the raw data 267 miRNAs conserved with 43 miRBase families. The predicted miRNAs are highly associated with abiotic stress responses, tolerance and development. GO enrichment analysis showed that they may play biological and physiological roles associated with cold, salt and aluminum (Al) through auxin signaling pathways, regulation of gene expression, ubiquitination, transport, carbohydrates, gibberellins, lipid, glutathione and secondary metabolism, photosynthesis, as well as floral transition and flowering. CONCLUSION: This approach provides a broad repertoire of hexaploid wheat miRNAs associated with abiotic stress responses, tolerance and development. These valuable resources of expressed wheat miRNAs will help in elucidating the regulatory mechanisms involved in freezing and Al responses and tolerance mechanisms as well as for development and flowering. In the long term, it may help in breeding stress tolerant plants.


Assuntos
Biologia Computacional/métodos , MicroRNAs/análise , RNA de Plantas/análise , Triticum/crescimento & desenvolvimento , Triticum/genética , Perfilação da Expressão Gênica/métodos , Regulação da Expressão Gênica de Plantas , Aprendizado de Máquina , Poliploidia , Especificidade da Espécie , Estresse Fisiológico
17.
Nucleic Acids Res ; 41(15): 7200-11, 2013 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-23748953

RESUMO

MicroRNAs (miRNAs) are short RNA species derived from hairpin-forming miRNA precursors (pre-miRNA) and acting as key posttranscriptional regulators. Most computational tools labeled as miRNA predictors are in fact pre-miRNA predictors and provide no information about the putative miRNA location within the pre-miRNA. Sequence and structural features that determine the location of the miRNA, and the extent to which these properties vary from species to species, are poorly understood. We have developed miRdup, a computational predictor for the identification of the most likely miRNA location within a given pre-miRNA or the validation of a candidate miRNA. MiRdup is based on a random forest classifier trained with experimentally validated miRNAs from miRbase, with features that characterize the miRNA-miRNA* duplex. Because we observed that miRNAs have sequence and structural properties that differ between species, mostly in terms of duplex stability, we trained various clade-specific miRdup models and obtained increased accuracy. MiRdup self-trains on the most recent version of miRbase and is easy to use. Combined with existing pre-miRNA predictors, it will be valuable for both de novo mapping of miRNAs and filtering of large sets of candidate miRNAs obtained from transcriptome sequencing projects. MiRdup is open source under the GPLv3 and available at http://www.cs.mcgill.ca/∼blanchem/mirdup/.


Assuntos
Biologia Computacional/métodos , MicroRNAs/análise , Precursores de RNA/análise , RNA de Plantas/análise , Software , Animais , Internet , Sequências Repetidas Invertidas , MicroRNAs/genética , Conformação de Ácido Nucleico , Plantas/genética , Precursores de RNA/genética , RNA de Plantas/genética , Sensibilidade e Especificidade , Análise de Sequência de RNA/métodos
18.
PLoS One ; 7(1): e29903, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22253821

RESUMO

In this paper we introduce Armadillo v1.1, a novel workflow platform dedicated to designing and conducting phylogenetic studies, including comprehensive simulations. A number of important phylogenetic and general bioinformatics tools have been included in the first software release. As Armadillo is an open-source project, it allows scientists to develop their own modules as well as to integrate existing computer applications. Using our workflow platform, different complex phylogenetic tasks can be modeled and presented in a single workflow without any prior knowledge of programming techniques. The first version of Armadillo was successfully used by professors of bioinformatics at Université du Quebec à Montreal during graduate computational biology courses taught in 2010-11. The program and its source code are freely available at: .


Assuntos
Biologia Computacional/métodos , Simulação por Computador , Filogenia , Software , Fluxo de Trabalho , Adiponectina/química , Sequência de Aminoácidos , Dados de Sequência Molecular , Alinhamento de Sequência , Interface Usuário-Computador
19.
BMC Bioinformatics ; 12 Suppl 9: S9, 2011 Oct 05.
Artigo em Inglês | MEDLINE | ID: mdl-22151279

RESUMO

BACKGROUND: The identification of functional regions contained in a given multiple sequence alignment constitutes one of the major challenges of comparative genomics. Several studies have focused on the identification of conserved regions and motifs. However, most of existing methods ignore the relationship between the functional genomic regions and the external evidence associated with the considered group of species (e.g., carcinogenicity of Human Papilloma Virus). In the past, we have proposed a method that takes into account the prior knowledge on an external evidence (e.g., carcinogenicity or invasivity of the considered organisms) and identifies genomic regions related to a specific disease. RESULTS AND CONCLUSION: We present a new algorithm for detecting genomic regions that may be associated with a disease. Two new variability functions and a bipartition optimization procedure are described. We validate and weigh our results using the Adjusted Rand Index (ARI), and thus assess to what extent the selected regions are related to carcinogenicity, invasivity, or any other species classification, given as input. The predictive power of different hit region detection functions was assessed on synthetic and real data. Our simulation results suggest that there is no a single function that provides the best results in all practical situations (e.g., monophyletic or polyphyletic evolution, and positive or negative selection), and that at least three different functions might be useful. The proposed hit region identification functions that do not benefit from the prior knowledge (i.e., carcinogenicity or invasivity of the involved organisms) can provide equivalent results than the existing functions that take advantage of such a prior knowledge. Using the new algorithm, we examined the Neisseria meningitidis FrpB gene product for invasivity and immunologic activity, and human papilloma virus (HPV) E6 oncoprotein for carcinogenicity, and confirmed some well-known molecular features, including surface exposed loops for N. meningitidis and PDZ domain for HPV.


Assuntos
Algoritmos , Genoma Bacteriano , Genoma Viral , Genômica/métodos , Infecções Bacterianas/microbiologia , Proteínas da Membrana Bacteriana Externa/genética , Humanos , Neisseria meningitidis/genética , Papillomaviridae/genética , Alinhamento de Sequência , Viroses/virologia
20.
Bioinformatics ; 27(13): i266-74, 2011 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-21685080

RESUMO

MOTIVATION: The identification of non-coding functional regions of the human genome remains one of the main challenges of genomics. By observing how a given region evolved over time, one can detect signs of negative or positive selection hinting that the region may be functional. With the quickly increasing number of vertebrate genomes to compare with our own, this type of approach is set to become extremely powerful, provided the right analytical tools are available. RESULTS: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate. Here, we propose a radically different approach to the detection of non-coding functional region: instead of measuring past evolutionary rates, we build a machine learning classifier to predict current substitution rates in human based on the inferred evolutionary events that affected the region during vertebrate evolution. We show that different types of evolutionary events, occurring along different branches of the phylogenetic tree, bring very different amounts of information. We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites. Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches. AVAILABILITY: The predictor and predictions made are available at http://www.mcb.mcgill.ca/~blanchem/sadri. CONTACT: blanchem@mcb.mcgill.ca.


Assuntos
Evolução Biológica , Filogenia , Animais , Inteligência Artificial , Genoma , Genoma Humano , Humanos , Fases de Leitura Aberta , Vertebrados/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA