Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
1.
Bioinformatics ; 38(16): 3984-3991, 2022 08 10.
Artículo en Inglés | MEDLINE | ID: mdl-35762945

RESUMEN

MOTIVATION: Precise identification of Biosynthetic Gene Clusters (BGCs) is a challenging task. Performance of BGC discovery tools is limited by their capacity to accurately predict components belonging to candidate BGCs, often overestimating cluster boundaries. To support optimizing the composition and boundaries of candidate BGCs, we propose reinforcement learning approach relying on protein domains and functional annotations from expert curated BGCs. RESULTS: The proposed reinforcement learning method aims to improve candidate BGCs obtained with state-of-the-art tools. It was evaluated on candidate BGCs obtained for two fungal genomes, Aspergillus niger and Aspergillus nidulans. The results highlight an improvement of the gene precision by above 15% for TOUCAN, fungiSMASH and DeepBGC; and cluster precision by above 25% for fungiSMASH and DeepBCG, allowing these tools to obtain almost perfect precision in cluster prediction. This can pave the way of optimizing current prediction of candidate BGCs in fungi, while minimizing the curation effort required by domain experts. AVAILABILITY AND IMPLEMENTATION: https://github.com/bioinfoUQAM/RL-bgc-components. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Hongos , Familia de Multigenes , Hongos/genética , Genoma Fúngico , Vías Biosintéticas/genética
2.
Plant Physiol ; 176(3): 2376-2394, 2018 03.
Artículo en Inglés | MEDLINE | ID: mdl-29259104

RESUMEN

Cold acclimation and winter survival in cereal species is determined by complicated environmentally regulated gene expression. However, studies investigating these complex cold responses are mostly conducted in controlled environments that only consider the responses to single environmental variables. In this study, we have comprehensively profiled global transcriptional responses in crowns of field-grown spring and winter wheat (Triticum aestivum) genotypes and their near-isogenic lines with the VRN-A1 alleles swapped. This in-depth analysis revealed multiple signaling, interactive pathways that influence cold tolerance and phenological development to optimize plant growth and development in preparation for a wide range of over-winter stresses. Investigation of genetic differences at the VRN-A1 locus revealed that a vernalization requirement maintained a higher level of cold response pathways while VRN-A1 genetically promoted floral development. Our results also demonstrated the influence of genetic background on the expression of cold and flowering pathways. The link between delayed shoot apex development and the induction of cold tolerance was reflected by the gradual up-regulation of abscisic acid-dependent and C-REPEAT-BINDING FACTOR pathways. This was accompanied by the down-regulation of key genes involved in meristem development as the autumn progressed. The chromosome location of differentially expressed genes between the winter and spring wheat genetic backgrounds showed a striking pattern of biased gene expression on chromosomes 6A and 6D, indicating a transcriptional regulation at the genome level. This finding adds to the complexity of the genetic cascades and gene interactions that determine the evolutionary patterns of both phenological development and cold tolerance traits in wheat.


Asunto(s)
Aclimatación/genética , Regulación de la Expresión Génica de las Plantas , Triticum/fisiología , Alelos , Pared Celular/genética , Pared Celular/metabolismo , Cromosomas de las Plantas , Análisis por Conglomerados , Respuesta al Choque por Frío/genética , Flores/genética , Perfilación de la Expresión Génica , Genotipo , Redes y Vías Metabólicas/genética , Polimorfismo Genético , Saskatchewan , Triticum/genética , Triticum/crecimiento & desarrollo
3.
Nucleic Acids Res ; 45(2): 556-566, 2017 01 25.
Artículo en Inglés | MEDLINE | ID: mdl-27899600

RESUMEN

MicroRNAs (miRNA) are short single-stranded RNA molecules derived from hairpin-forming precursors that play a crucial role as post-transcriptional regulators in eukaryotes and viruses. In the past years, many microRNA target genes (MTGs) have been identified experimentally. However, because of the high costs of experimental approaches, target genes databases remain incomplete. Although several target prediction programs have been developed in the recent years to identify MTGs in silico, their specificity and sensitivity remain low. Here, we propose a new approach called MirAncesTar, which uses ancestral genome reconstruction to boost the accuracy of existing MTGs prediction tools for human miRNAs. For each miRNA and each putative human target UTR, our algorithm makes uses of existing prediction tools to identify putative target sites in the human UTR, as well as in its mammalian orthologs and inferred ancestral sequences. It then evaluates evidence in support of selective pressure to maintain target site counts (rather than sequences), accounting for the possibility of target site turnover. It finally integrates this measure with several simpler ones using a logistic regression predictor. MirAncesTar improves the accuracy of existing MTG predictors by 26% to 157%. Source code and prediction results for human miRNAs, as well as supporting evolutionary data are available at http://cs.mcgill.ca/∼blanchem/mirancestar.


Asunto(s)
Biología Computacional/métodos , MicroARNs/genética , Interferencia de ARN , ARN Mensajero/genética , Algoritmos , Animales , Sitios de Unión , Simulación por Computador , Humanos , MicroARNs/química , ARN Mensajero/química
4.
BMC Bioinformatics ; 18(1): 208, 2017 Apr 11.
Artículo en Inglés | MEDLINE | ID: mdl-28399797

RESUMEN

BACKGROUND: Advances in cloning and sequencing technology are yielding a massive number of viral genomes. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific well-studied family of viruses. Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families. RESULTS: Here, we introduce a virus classification platform, CASTOR, based on machine learning methods. CASTOR is inspired by a well-known technique in molecular biology: restriction fragment length polymorphism (RFLP). It simulates, in silico, the restriction digestion of genomic material by different enzymes into fragments. It uses two metrics to construct feature vectors for machine learning algorithms in the classification step. We benchmark CASTOR for the classification of distinct datasets of human papillomaviruses (HPV), hepatitis B viruses (HBV) and human immunodeficiency viruses type 1 (HIV-1). Results reveal true positive rates of 99%, 99% and 98% for HPV Alpha species, HBV genotyping and HIV-1 M subtyping, respectively. Furthermore, CASTOR shows a competitive performance compared to well-known HIV-1 specific classifiers (REGA and COMET) on whole genomes and pol fragments. CONCLUSION: The performance of CASTOR, its genericity and robustness could permit to perform novel and accurate large scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine learning classifiers. CASTOR can be accessed at http://castor.bioinfo.uqam.ca .


Asunto(s)
Genoma Viral , Genómica/métodos , Aprendizaje Automático , Clasificación , Simulación por Computador , VIH-1/genética , Virus de la Hepatitis B/genética , Humanos , Papillomaviridae/genética , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ARN/métodos
5.
BMC Bioinformatics ; 16: 68, 2015 Mar 03.
Artículo en Inglés | MEDLINE | ID: mdl-25887434

RESUMEN

BACKGROUND: Workflows, or computational pipelines, consisting of collections of multiple linked tasks are becoming more and more popular in many scientific fields, including computational biology. For example, simulation studies, which are now a must for statistical validation of new bioinformatics methods and software, are frequently carried out using the available workflow platforms. Workflows are typically organized to minimize the total execution time and to maximize the efficiency of the included operations. Clustering algorithms can be applied either for regrouping similar workflows for their simultaneous execution on a server, or for dispatching some lengthy workflows to different servers, or for classifying the available workflows with a view to performing a specific keyword search. RESULTS: In this study, we consider four different workflow encoding and clustering schemes which are representative for bioinformatics projects. Some of them allow for clustering workflows with similar topological features, while the others regroup workflows according to their specific attributes (e.g. associated keywords) or execution time. The four types of workflow encoding examined in this study were compared using the weighted versions of k-means and k-medoids partitioning algorithms. The Calinski-Harabasz, Silhouette and logSS clustering indices were considered. Hierarchical classification methods, including the UPGMA, Neighbor Joining, Fitch and Kitsch algorithms, were also applied to classify bioinformatics workflows. Moreover, a novel pairwise measure of clustering solution stability, which can be computed in situations when a series of independent program runs is carried out, was introduced. CONCLUSIONS: Our findings based on the analysis of 220 real-life bioinformatics workflows suggest that the weighted clustering models based on keywords information or tasks execution times provide the most appropriate clustering solutions. Using datasets generated by the Armadillo and Taverna scientific workflow management system, we found that the weighted cosine distance in association with the k-medoids partitioning algorithm and the presence-absence workflow encoding provided the highest values of the Rand index among all compared clustering strategies. The introduced clustering stability indices, PS and PSG, can be effectively used to identify elements with a low clustering support.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Programas Informáticos , Flujo de Trabajo , Análisis por Conglomerados , Conjuntos de Datos como Asunto , Filogenia
6.
BMC Genomics ; 16: 339, 2015 Apr 24.
Artículo en Inglés | MEDLINE | ID: mdl-25903161

RESUMEN

BACKGROUND: Wheat is a major staple crop with broad adaptability to a wide range of environmental conditions. This adaptability involves several stress and developmentally responsive genes, in which microRNAs (miRNAs) have emerged as important regulatory factors. However, the currently used approaches to identify miRNAs in this polyploid complex system focus on conserved and highly expressed miRNAs avoiding regularly those that are often lineage-specific, condition-specific, or appeared recently in evolution. In addition, many environmental and biological factors affecting miRNA expression were not yet considered, resulting still in an incomplete repertoire of wheat miRNAs. RESULTS: We developed a conservation-independent technique based on an integrative approach that combines machine learning, bioinformatic tools, biological insights of known miRNA expression profiles and universal criteria of plant miRNAs to identify miRNAs with more confidence. The developed pipeline can potentially identify novel wheat miRNAs that share features common to several species or that are species specific or clade specific. It allowed the discovery of 199 miRNA candidates associated with different abiotic stresses and development stages. We also highlight from the raw data 267 miRNAs conserved with 43 miRBase families. The predicted miRNAs are highly associated with abiotic stress responses, tolerance and development. GO enrichment analysis showed that they may play biological and physiological roles associated with cold, salt and aluminum (Al) through auxin signaling pathways, regulation of gene expression, ubiquitination, transport, carbohydrates, gibberellins, lipid, glutathione and secondary metabolism, photosynthesis, as well as floral transition and flowering. CONCLUSION: This approach provides a broad repertoire of hexaploid wheat miRNAs associated with abiotic stress responses, tolerance and development. These valuable resources of expressed wheat miRNAs will help in elucidating the regulatory mechanisms involved in freezing and Al responses and tolerance mechanisms as well as for development and flowering. In the long term, it may help in breeding stress tolerant plants.


Asunto(s)
Biología Computacional/métodos , MicroARNs/análisis , ARN de Planta/análisis , Triticum/crecimiento & desarrollo , Triticum/genética , Perfilación de la Expresión Génica/métodos , Regulación de la Expresión Génica de las Plantas , Aprendizaje Automático , Poliploidía , Especificidad de la Especie , Estrés Fisiológico
7.
Nucleic Acids Res ; 41(15): 7200-11, 2013 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-23748953

RESUMEN

MicroRNAs (miRNAs) are short RNA species derived from hairpin-forming miRNA precursors (pre-miRNA) and acting as key posttranscriptional regulators. Most computational tools labeled as miRNA predictors are in fact pre-miRNA predictors and provide no information about the putative miRNA location within the pre-miRNA. Sequence and structural features that determine the location of the miRNA, and the extent to which these properties vary from species to species, are poorly understood. We have developed miRdup, a computational predictor for the identification of the most likely miRNA location within a given pre-miRNA or the validation of a candidate miRNA. MiRdup is based on a random forest classifier trained with experimentally validated miRNAs from miRbase, with features that characterize the miRNA-miRNA* duplex. Because we observed that miRNAs have sequence and structural properties that differ between species, mostly in terms of duplex stability, we trained various clade-specific miRdup models and obtained increased accuracy. MiRdup self-trains on the most recent version of miRbase and is easy to use. Combined with existing pre-miRNA predictors, it will be valuable for both de novo mapping of miRNAs and filtering of large sets of candidate miRNAs obtained from transcriptome sequencing projects. MiRdup is open source under the GPLv3 and available at http://www.cs.mcgill.ca/∼blanchem/mirdup/.


Asunto(s)
Biología Computacional/métodos , MicroARNs/análisis , Precursores del ARN/análisis , ARN de Planta/análisis , Programas Informáticos , Animales , Internet , Secuencias Invertidas Repetidas , MicroARNs/genética , Conformación de Ácido Nucleico , Plantas/genética , Precursores del ARN/genética , ARN de Planta/genética , Sensibilidad y Especificidad , Análisis de Secuencia de ARN/métodos
8.
Genome Res ; 21(6): 850-62, 2011 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-21518738

RESUMEN

Here we provide a detailed comparative analysis across the candidate X-Inactivation Center (XIC) region and the XIST locus in the genomes of six primates and three mammalian outgroup species. Since lemurs and other strepsirrhine primates represent the sister lineage to all other primates, this analysis focuses on lemurs to reconstruct the ancestral primate sequences and to gain insight into the evolution of this region and the genes within it. This comparative evolutionary genomics approach reveals significant expansion in genomic size across the XIC region in higher primates, with minimal size alterations across the XIST locus itself. Reconstructed primate ancestral XIC sequences show that the most dramatic changes during the past 80 million years occurred between the ancestral primate and the lineage leading to Old World monkeys. In contrast, the XIST locus compared between human and the primate ancestor does not indicate any dramatic changes to exons or XIST-specific repeats; rather, evolution of this locus reflects small incremental changes in overall sequence identity and short repeat insertions. While this comparative analysis reinforces that the region around XIST has been subject to significant genomic change, even among primates, our data suggest that evolution of the XIST sequences themselves represents only small lineage-specific changes across the past 80 million years.


Asunto(s)
Evolución Molecular , Genes Ligados a X/genética , Lemur/genética , Filogenia , ARN no Traducido/genética , Animales , Secuencia de Bases , Cromosomas Artificiales Bacterianos , Biología Computacional , ADN Complementario/genética , Humanos , Hibridación Fluorescente in Situ , Funciones de Verosimilitud , Modelos Genéticos , Datos de Secuencia Molecular , Reacción en Cadena de la Polimerasa , ARN Largo no Codificante , Análisis de Secuencia de ADN , Especificidad de la Especie
9.
PLoS One ; 19(1): e0296627, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38241279

RESUMEN

Machine learning was shown to be effective at identifying distinctive genomic signatures among viral sequences. These signatures are defined as pervasive motifs in the viral genome that allow discrimination between species or variants. In the context of SARS-CoV-2, the identification of these signatures can assist in taxonomic and phylogenetic studies, improve in the recognition and definition of emerging variants, and aid in the characterization of functional properties of polymorphic gene products. In this paper, we assess KEVOLVE, an approach based on a genetic algorithm with a machine-learning kernel, to identify multiple genomic signatures based on minimal sets of k-mers. In a comparative study, in which we analyzed large SARS-CoV-2 genome dataset, KEVOLVE was more effective at identifying variant-discriminative signatures than several gold-standard statistical tools. Subsequently, these signatures were characterized using a new extension of KEVOLVE (KANALYZER) to highlight variations of the discriminative signatures among different classes of variants, their genomic location, and the mutations involved. The majority of identified signatures were associated with known mutations among the different variants, in terms of functional and pathological impact based on available literature. Here we showed that KEVOLVE is a robust machine learning approach to identify discriminative signatures among SARS-CoV-2 variants, which are frequently also biologically relevant, while bypassing multiple sequence alignments. The source code of the method and additional resources are available at: https://github.com/bioinfoUQAM/KEVOLVE.


Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Filogenia , COVID-19/diagnóstico , COVID-19/genética , Genómica , Aprendizaje Automático
10.
Bioinformatics ; 27(13): i266-74, 2011 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-21685080

RESUMEN

MOTIVATION: The identification of non-coding functional regions of the human genome remains one of the main challenges of genomics. By observing how a given region evolved over time, one can detect signs of negative or positive selection hinting that the region may be functional. With the quickly increasing number of vertebrate genomes to compare with our own, this type of approach is set to become extremely powerful, provided the right analytical tools are available. RESULTS: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate. Here, we propose a radically different approach to the detection of non-coding functional region: instead of measuring past evolutionary rates, we build a machine learning classifier to predict current substitution rates in human based on the inferred evolutionary events that affected the region during vertebrate evolution. We show that different types of evolutionary events, occurring along different branches of the phylogenetic tree, bring very different amounts of information. We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites. Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches. AVAILABILITY: The predictor and predictions made are available at http://www.mcb.mcgill.ca/~blanchem/sadri. CONTACT: blanchem@mcb.mcgill.ca.


Asunto(s)
Evolución Biológica , Filogenia , Animales , Inteligencia Artificial , Genoma , Genoma Humano , Humanos , Sistemas de Lectura Abierta , Vertebrados/genética
11.
BMC Bioinformatics ; 12 Suppl 9: S9, 2011 Oct 05.
Artículo en Inglés | MEDLINE | ID: mdl-22151279

RESUMEN

BACKGROUND: The identification of functional regions contained in a given multiple sequence alignment constitutes one of the major challenges of comparative genomics. Several studies have focused on the identification of conserved regions and motifs. However, most of existing methods ignore the relationship between the functional genomic regions and the external evidence associated with the considered group of species (e.g., carcinogenicity of Human Papilloma Virus). In the past, we have proposed a method that takes into account the prior knowledge on an external evidence (e.g., carcinogenicity or invasivity of the considered organisms) and identifies genomic regions related to a specific disease. RESULTS AND CONCLUSION: We present a new algorithm for detecting genomic regions that may be associated with a disease. Two new variability functions and a bipartition optimization procedure are described. We validate and weigh our results using the Adjusted Rand Index (ARI), and thus assess to what extent the selected regions are related to carcinogenicity, invasivity, or any other species classification, given as input. The predictive power of different hit region detection functions was assessed on synthetic and real data. Our simulation results suggest that there is no a single function that provides the best results in all practical situations (e.g., monophyletic or polyphyletic evolution, and positive or negative selection), and that at least three different functions might be useful. The proposed hit region identification functions that do not benefit from the prior knowledge (i.e., carcinogenicity or invasivity of the involved organisms) can provide equivalent results than the existing functions that take advantage of such a prior knowledge. Using the new algorithm, we examined the Neisseria meningitidis FrpB gene product for invasivity and immunologic activity, and human papilloma virus (HPV) E6 oncoprotein for carcinogenicity, and confirmed some well-known molecular features, including surface exposed loops for N. meningitidis and PDZ domain for HPV.


Asunto(s)
Algoritmos , Genoma Bacteriano , Genoma Viral , Genómica/métodos , Infecciones Bacterianas/microbiología , Proteínas de la Membrana Bacteriana Externa/genética , Humanos , Neisseria meningitidis/genética , Papillomaviridae/genética , Alineación de Secuencia , Virosis/virología
12.
Bioinformatics ; 26(1): 130-1, 2010 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-19850756

RESUMEN

SUMMARY: The computational inference of ancestral genomes consists of five difficult steps: identifying syntenic regions, inferring ancestral arrangement of syntenic regions, aligning multiple sequences, reconstructing the insertion and deletion history and finally inferring substitutions. Each of these steps have received lot of attention in the past years. However, there currently exists no framework that integrates all of the different steps in an easy workflow. Here, we introduce Ancestors 1.0, a web server allowing one to easily and quickly perform the last three steps of the ancestral genome reconstruction procedure. It implements several alignment algorithms, an indel maximum likelihood solver and a context-dependent maximum likelihood substitution inference algorithm. The results presented by the server include the posterior probabilities for the last two steps of the ancestral genome reconstruction and the expected error rate of each ancestral base prediction. AVAILABILITY: The Ancestors 1.0 is available at http://ancestors.bioinfo.uqam.ca/ancestorWeb/.


Asunto(s)
Mapeo Cromosómico/métodos , Genoma/genética , Internet , Linaje , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Animales , Secuencia de Bases , Humanos , Datos de Secuencia Molecular
13.
NAR Genom Bioinform ; 2(4): lqaa098, 2020 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-33575642

RESUMEN

Fungal secondary metabolites (SMs) are an important source of numerous bioactive compounds largely applied in the pharmaceutical industry, as in the production of antibiotics and anticancer medications. The discovery of novel fungal SMs can potentially benefit human health. Identifying biosynthetic gene clusters (BGCs) involved in the biosynthesis of SMs can be a costly and complex task, especially due to the genomic diversity of fungal BGCs. Previous studies on fungal BGC discovery present limited scope and can restrict the discovery of new BGCs. In this work, we introduce TOUCAN, a supervised learning framework for fungal BGC discovery. Unlike previous methods, TOUCAN is capable of predicting BGCs on amino acid sequences, facilitating its use on newly sequenced and not yet curated data. It relies on three main pillars: rigorous selection of datasets by BGC experts; combination of functional, evolutionary and compositional features coupled with outperforming classifiers; and robust post-processing methods. TOUCAN best-performing model yields 0.982 F-measure on BGC regions in the Aspergillus niger genome. Overall results show that TOUCAN outperforms previous approaches. TOUCAN focuses on fungal BGCs but can be easily adapted to expand its scope to process other species or include new features.

14.
J Comput Biol ; 26(6): 519-535, 2019 06.
Artículo en Inglés | MEDLINE | ID: mdl-31050550

RESUMEN

The classification of pathogens in emerging and re-emerging viruses represents major interests in taxonomic studies, functional genomics, host-pathogen interplay, prevention, and disease treatments. It consists of assigning a given sequence to its related group of known sequences sharing similar characteristics and traits. The challenges to such classification could be associated with several virus properties including recombination, mutation rate, multiplicity of motifs, and diversity. In domains such as pathogen monitoring and surveillance, it is important to detect and quantify known and novel taxa without exploiting the full and accurate alignments or virus family profiles. In this study, we propose an alignment-free method, CASTOR-KRFE, to detect discriminating subsequences within known pathogen sequences to classify accurately unknown pathogen sequences. This method includes three major steps: (1) vectorization of known viral genomic sequences based on k-mers to constitute the potential features, (2) efficient way of pattern extraction and evaluation maximizing classification performance, and (3) prediction of the minimal set of features fitting a given criterion (threshold of performance metric and maximum number of features). We assessed this method through a jackknife data partitioning on a dozen of various virus data sets, covering the seven major virus groups and including influenza virus, Ebola virus, human immunodeficiency virus 1, hepatitis C virus, hepatitis B virus, and human papillomavirus. CASTOR-KRFE provides a weighted average F-measure >0.96 over a wide range of viruses. Our method also shows better performance on complex virus data sets than multiple subsequences extractor for classification (MISSEL), a subsequence extraction method, and the Discriminative mode of MEME patterns extraction tool.


Asunto(s)
Genoma Viral/genética , Análisis de Secuencia de ADN/métodos , Virus/genética , Algoritmos , Genómica/métodos , Humanos , Alineación de Secuencia/métodos
15.
Methods Mol Biol ; 422: 171-84, 2008.
Artículo en Inglés | MEDLINE | ID: mdl-18629667

RESUMEN

This chapter introduces the problem of ancestral sequence reconstruction: given a set of extant orthologous DNA genomic sequences (or even whole-genomes), together with a phylogenetic tree relating these sequences, predict the DNA sequence of all ancestral species in the tree. Blanchette et al. (1) have shown that for certain sets of species (in particular, for eutherian mammals), very accurate reconstruction can be obtained. We explain the main steps involved in this process, including multiple sequence alignment, insertion and deletion inference, substitution inference, and gene arrangement inference. We also describe a simulation-based procedure to assess the accuracy of the reconstructed sequences. The whole reconstruction process is illustrated using a set of mammalian sequences from the CFTR region.


Asunto(s)
Biología Computacional/métodos , Filogenia , Animales , Secuencia de Bases , Simulación por Computador , Humanos , Datos de Secuencia Molecular
16.
Artículo en Inglés | MEDLINE | ID: mdl-29994265

RESUMEN

This paper introduces a method for automatic workflow extraction from texts using Process-Oriented Case-Based Reasoning (POCBR). While the current workflow management systems implement mostly different complicated graphical tasks based on advanced distributed solutions (e.g. cloud computing and grid computation), workflow knowledge acquisition from texts using case-based reasoning represents more expressive and semantic cases representations. We propose in this context, an ontology-based workflow extraction framework to acquire processual knowledge from texts. Our methodology extends classic NLP techniques to extract and disambiguate tasks and relations in texts. Using a graph-based representation of workflows and a domain ontology, our extraction process uses a context-aware approach to recognize workflow components: data and control flows. We applied our framework in a technical domain in bioinformatics: i.e. phylogenetic analyses. An evaluation based on workflow semantic similarities on a gold standard proves that our approach provides promising results in the process extraction domain. Both data and implementation of our framework are available in: http://labo.bioinfo.uqam.ca/tgrowler.

17.
Infect Genet Evol ; 62: 141-150, 2018 08.
Artículo en Inglés | MEDLINE | ID: mdl-29678797

RESUMEN

Pregnancy is associated with modulations of maternal immunity that contribute to foeto-maternal tolerance. To understand whether and how these alterations impact antiviral immunity, a detailed cross-sectional analysis of selective pressures exerted on HIV-1 envelope amino-acid sequences was performed in a group of pregnant (n = 32) and non-pregnant (n = 44) HIV-infected women in absence of treatment with antiretroviral therapy (ART). Independent of HIV-1 subtype, p-distance, dN and dS were all strongly correlated with one another but were not significantly different in pregnant as compared to non-pregnant patients. Differential levels of selective pressure applied on different Env subdomains displayed similar yet non-identical patterns between the two groups, with pressure applied on C1 being significantly lower in constant regions C1 and C2 than in V1, V2, V3 and C3. To draw a general picture of the selection applied on the envelope and compensate for inter-individual variations, we performed a binomial test on selection frequency data pooled from pregnant and non-pregnant women. This analysis uncovered 42 positions, present in both groups, exhibiting statistically-significant frequency of selection that invariably mapped to the surface of the Env protein, with the great majority located within epitopes recognized by Env-specific antibodies or sites associated with the development of cross-reactive neutralizing activity. The median frequency of occurrence of positive selection per site was significantly lower in pregnant versus non-pregnant women. Furthermore, examination of the distribution of positively selected sites using a hypergeometric test revealed that only 2 positions (D137 and S142) significantly differed between the 2 groups. Taken together, these result indicate that pregnancy is associated with subtle yet distinctive changes in selective pressures exerted on the HIV-1 Env protein that are compatible with transient modulations of maternal immunity.


Asunto(s)
Infecciones por VIH/virología , VIH-1/genética , Complicaciones Infecciosas del Embarazo/virología , Productos del Gen env del Virus de la Inmunodeficiencia Humana/genética , Evolución Molecular , Femenino , Humanos , Modelos Moleculares , Embarazo , Conformación Proteica , Selección Genética
18.
J Comput Biol ; 14(4): 446-61, 2007 May.
Artículo en Inglés | MEDLINE | ID: mdl-17572023

RESUMEN

Given a multiple alignment of orthologous DNA sequences and a phylogenetic tree for these sequences, we investigate the problem of reconstructing the most likely scenario of insertions and deletions capable of explaining the gaps observed in the alignment. This problem, that we called the Indel Maximum Likelihood Problem (IMLP), is an important step toward the reconstruction of ancestral genomics sequences, and is important for studying evolutionary processes, genome function, adaptation and convergence. We solve the IMLP using a new type of tree hidden Markov model whose states correspond to single-base evolutionary scenarios and where transitions model dependencies between neighboring columns. The standard Viterbi and Forward-backward algorithms are optimized to produce the most likely ancestral reconstruction and to compute the level of confidence associated to specific regions of the reconstruction. A heuristic is presented to make the method practical for large data sets, while retaining an extremely high degree of accuracy. The methods are illustrated on a 1-Mb alignment of the CFTR regions from 12 mammals.


Asunto(s)
Evolución Molecular , Genoma , Mamíferos/genética , Modelos Genéticos , Algoritmos , Animales , Regulador de Conductancia de Transmembrana de Fibrosis Quística/genética , Cadenas de Markov
19.
J Comput Biol ; 24(8): 799-808, 2017 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-28742392

RESUMEN

Contemporary workflow management systems are driven by explicit process models specifying the interdependencies between tasks. Creating these models is a challenging and time-consuming task. Existing approaches to mining concrete workflows into models tackle design aspects related to the diverging abstraction levels of the tasks. Concrete workflow logs represent tasks and cases of concrete events-partially or totally ordered-grounding hidden multilevel (abstract) semantics and contexts. Relevant generalized events could be rediscovered within these processes. We propose, in this article, an ontology-based workflow mining system to generate patterns from sequences of events that are themselves extracted from texts. Our system T-GOWler (Generalized Ontology-based WorkfLow minER within Texts) is based on two ontology-based modules: a workflow extractor and a pattern miner. To this end, it uses two different ontologies: a domain one (to support workflow extraction from texts) and a processual one (to mine generalized patterns from extracted workflows).


Asunto(s)
Algoritmos , Biología Computacional/métodos , Minería de Datos/métodos , Ontología de Genes , Semántica , Humanos , Filogenia , Flujo de Trabajo
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA