Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Artículo en Inglés | MEDLINE | ID: mdl-34941517

RESUMEN

Protein-Protein Interactions (PPIs) are a crucial mechanism underpinning the function of the cell. So far, a wide range of machine-learning based methods have been proposed for predicting these relationships. Their success is heavily dependent on the construction of the underlying feature vectors, with most using a set of physico-chemical properties derived from the sequence. Few work directly with the sequence itself. In this paper, we explore the utility of sequence embeddings for predicting protein-protein interactions. We construct a protein pair feature vector by concatenating the embeddings of their constituent sequence. These feature vectors are then used as input to a binary classifier to make predictions. To learn sequence embeddings, we use two established Word2Vec based methods - Seq2Vec and BioVec - and we also introduce a novel feature construction method called SuperVecNW. The embeddings generated through SuperVecNW capture some network information in addition to the contextual information present in the sequences. We test the efficacy of our proposed approach on human and yeast PPI datasets and on three well-known networks: CD9, the Ras-Raf-Mek-Erk-Elk-Srf pathway, and a Wnt-related network. We demonstrate that low dimensional sequence embeddings provide better results than most alternative representations based on physico-chemical properties while offering a far simple approach to feature vector construction.

2.
Front Genet ; 13: 643592, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35295949

RESUMEN

We present a novel approach to the Metagenomic Geolocation Challenge based on random projection of the sample reads from each location. This approach explores the direct use of k-mer composition to characterise samples so that we can avoid the computationally demanding step of aligning reads to available microbial reference sequences. Each variable-length read is converted into a fixed-length, k-mer-based read signature. Read signatures are then clustered into location signatures which provide a more compact characterisation of the reads at each location. Classification is then treated as a problem in ranked retrieval of locations, where signature similarity is used as a measure of similarity in microbial composition. We evaluate our approach using the CAMDA 2020 Challenge dataset and obtain promising results based on nearest neighbour classification. The main findings of this study are that k-mer representations carry sufficient information to reveal the origin of many of the CAMDA 2020 Challenge metagenomic samples, and that this reference-free approach can be achieved with much less computation than methods that need reads to be assigned to operational taxonomic units-advantages which become clear through comparison to previously published work on the CAMDA 2019 Challenge data.

3.
IEEE Trans Vis Comput Graph ; 28(12): 4477-4489, 2022 12.
Artículo en Inglés | MEDLINE | ID: mdl-34156943

RESUMEN

Genomic research emerges from collaborative work within and across different scientific disciplines. A diverse range of visualisation techniques has been employed to aid this research, yet relatively little is known as to how these techniques facilitate collaboration. We conducted a case study of collaborative research within a biomedical institute to learn more about the role visualisation plays in genomic mapping. Interviews were conducted with molecular biologists (N = 5) and bioinformaticians (N = 6). We found that genomic research comprises a variety of distinct disciplines engaged in complex analytic tasks that each resist simplification, and their complexity influences how visualisations were used. Visualisation use was impacted by group-specific interactions and temporal work patterns. Visualisations were also crucial to the scientific workflow, used for both question formation and confirmation of hypotheses, and acted as an anchor for the communication of ideas and discussion. In the latter case, two approaches were taken: providing collaborators with either interactive or static imagery representing a viewpoint. The use of generic software for simplified visualisations, and quick production and curation was also noted. We discuss these findings with reference to group-specific interactions and present recommendations for improving collaborative practices through visual analytics.


Asunto(s)
Gráficos por Computador , Programas Informáticos , Comunicación , Genómica , Mapeo Cromosómico
4.
PLoS One ; 15(3): e0216636, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32168338

RESUMEN

Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment-free alternatives exploiting the underlying lexical structure of each sequence. In this paper, we introduce two supervised approaches-SuperVec and SuperVecX-to learn sequence embeddings. These methods extend earlier Representation Learning (RepL) based methods to include class-related information for each sequence during training. Including class information ensures that related sequence fragments have proximal representations in the target space, better reflecting the structure of the domain. We show the quality of the embeddings learned through these methods on (i) sequence retrieval and (ii) classification tasks. We also propose an hierarchical tree-based approach specifically designed for the sequence retrieval problem. The resulting methods, which we term H-SuperVec or H-SuperVecX, according to their respective use of SuperVec or SuperVecX, learn embeddings across a range of feature spaces based on exclusive and exhaustive subsets of the class labels. Experiments show that the proposed methods perform better for retrieval and classification tasks over existing (unsupervised) RepL-based approaches. Further, the new methods are an order of magnitude faster than BLAST for the database retrieval task, supporting hybrid approaches that rapidly filter the collection so that only potentially relevant records remain. Such filtering of the original database allows slower but more accurate methods to be executed quickly over a far smaller dataset. Thus, we may achieve faster query processing and higher precision than before.


Asunto(s)
Algoritmos , Aprendizaje Automático , Homología de Secuencia , Área Bajo la Curva , Bases de Datos Factuales , Factores de Tiempo
5.
Brief Bioinform ; 20(2): 426-435, 2019 03 22.
Artículo en Inglés | MEDLINE | ID: mdl-28673025

RESUMEN

We are amidst an ongoing flood of sequence data arising from the application of high-throughput technologies, and a concomitant fundamental revision in our understanding of how genomes evolve individually and within the biosphere. Workflows for phylogenomic inference must accommodate data that are not only much larger than before, but often more error prone and perhaps misassembled, or not assembled in the first place. Moreover, genomes of microbes, viruses and plasmids evolve not only by tree-like descent with modification but also by incorporating stretches of exogenous DNA. Thus, next-generation phylogenomics must address computational scalability while rethinking the nature of orthogroups, the alignment of multiple sequences and the inference and comparison of trees. New phylogenomic workflows have begun to take shape based on so-called alignment-free (AF) approaches. Here, we review the conceptual foundations of AF phylogenetics for the hierarchical (vertical) and reticulate (lateral) components of genome evolution, focusing on methods based on k-mers. We reflect on what seems to be successful, and on where further development is needed.


Asunto(s)
Evolución Molecular , Genoma , Filogenia , Algoritmos , Animales , Humanos , Microbiota/genética , Modelos Genéticos , Alineación de Secuencia , Análisis de Secuencia de ADN , Virus/genética
6.
BMC Bioinformatics ; 19(Suppl 20): 509, 2018 Dec 21.
Artículo en Inglés | MEDLINE | ID: mdl-30577803

RESUMEN

BACKGROUND: Sequencing highly-variable 16S regions is a common and often effective approach to the study of microbial communities, and next-generation sequencing (NGS) technologies provide abundant quantities of data for analysis. However, the speed of existing analysis pipelines may limit our ability to work with these quantities of data. Furthermore, the limited coverage of existing 16S databases may hamper our ability to characterise these communities, particularly in the context of complex or poorly studied environments. RESULTS: In this article we present the SigClust algorithm, a novel clustering method involving the transformation of sequence reads into binary signatures. When compared to other published methods, SigClust yields superior cluster coherence and separation of metagenomic read data, while operating within substantially reduced timeframes. We demonstrate its utility on published Illumina datasets and on a large collection of labelled wound reads sourced from patients in a wound clinic. The temporal analysis is based on tracking the dominant clusters of wound samples over time. The analysis can identify markers of both healing and non-healing wounds in response to treatment. Prominent clusters are found, corresponding to bacterial species known to be associated with unfavourable healing outcomes, including a number of strains of Staphylococcus aureus. CONCLUSIONS: SigClust identifies clusters rapidly and supports an improved understanding of the wound microbiome without reliance on a reference database. The results indicate a promising use for a SigClust-based pipeline in wound analysis and prediction, and a possible novel method for wound management and treatment.


Asunto(s)
Análisis de Datos , Metagenómica/métodos , Algoritmos , Análisis por Conglomerados , Humanos , Microbiota/genética
7.
PeerJ ; 6: e5515, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-30155371

RESUMEN

BACKGROUND: It is possible to detect bacterial species in shotgun metagenome datasets through the presence of only a few sequence reads. However, false positive results can arise, as was the case in the initial findings of a recent New York City subway metagenome project. False positives are especially likely when two closely related are present in the same sample. Bacillus anthracis, the etiologic agent of anthrax, is a high-consequence pathogen that shares >99% average nucleotide identity with Bacillus cereus group (BCerG) genomes. Our goal was to create an analysis tool that used k-mers to detect B. anthracis, incorporating information about the coverage of BCerG in the metagenome sample. METHODS: Using public complete genome sequence datasets, we identified a set of 31-mer signatures that differentiated B. anthracis from other members of the B. cereus group (BCerG), and another set which differentiated BCerG genomes (including B. anthracis) from other Bacillus strains. We also created a set of 31-mers for detecting the lethal factor gene, the key genetic diagnostic of the presence of anthrax-causing bacteria. We created synthetic sequence datasets based on existing genomes to test the accuracy of a k-mer based detection model. RESULTS: We found 239,503 B. anthracis-specific 31-mers (the Ba31 set), 10,183 BCerG 31-mers (the BCerG31 set), and 2,617 lethal factor k-mers (the lef31 set). We showed that false positive B. anthracis k-mers-which arise from random sequencing errors-are observable at high genome coverages of B. cereus. We also showed that there is a "gray zone" below 0.184× coverage of the B. anthracis genome sequence, in which we cannot expect with high probability to identify lethal factor k-mers. We created a linear regression model to differentiate the presence of B. anthracis-like chromosomes from sequencing errors given the BCerG background coverage. We showed that while shotgun datasets from the New York City subway metagenome project had no matches to lef31 k-mers and hence were negative for B. anthracis, some samples showed evidence of strains very closely related to the pathogen. DISCUSSION: This work shows how extensive libraries of complete genomes can be used to create organism-specific signatures to help interpret metagenomes. We contrast "specialist" approaches to metagenome analysis such as this work to "generalist" software that seeks to classify all organisms present in the sample and note the more general utility of a k-mer filter approach when taxonomic boundaries lack clarity or high levels of precision are required.

8.
Sci Rep ; 4: 6504, 2014 Sep 30.
Artículo en Inglés | MEDLINE | ID: mdl-25266120

RESUMEN

Alignment-free methods, in which shared properties of sub-sequences (e.g. identity or match length) are extracted and used to compute a distance matrix, have recently been explored for phylogenetic inference. However, the scalability and robustness of these methods to key evolutionary processes remain to be investigated. Here, using simulated sequence sets of various sizes in both nucleotides and amino acids, we systematically assess the accuracy of phylogenetic inference using an alignment-free approach, based on D2 statistics, under different evolutionary scenarios. We find that compared to a multiple sequence alignment approach, D2 methods are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. Across diverse empirical datasets, the alignment-free methods perform well for sequences sharing low divergence, at greater computation speed. Our findings provide strong evidence for the scalability and the potential use of alignment-free methods in large-scale phylogenomics.


Asunto(s)
Secuencia de Aminoácidos/genética , Secuencia de Bases/genética , Evolución Molecular , Filogenia , Biología Computacional , Alineación de Secuencia , Análisis de Secuencia de ADN , Análisis de Secuencia de Proteína , Programas Informáticos
9.
Bioorg Med Chem Lett ; 18(9): 2878-82, 2008 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-18434151

RESUMEN

The complex formed from crystallization of human farnesyl pyrophosphate synthase (hFPPS) from a solution of racemic [6,7-dihydro-5H-cyclopenta[c]pyridin-7-yl(hydroxy)methylene]bis(phosphonic acid) (NE-10501, 8), a chiral analog of the anti-osteoporotic drug risedronate, contained the R enantiomer in the enzyme active site. This enantiospecificity was assessed by computer modeling of inhibitor-active site interactions using Autodock 3, which was also evaluated for predictive ability in calculations of the known configurations of risedronate, zoledronate, and minodronate complexed in the active site of hFPPS. In comparison with these structures, the 8 complex exhibited certain differences, including the presence of only one Mg(2+), which could contribute to its 100-fold higher IC(50). An improved synthesis of 8 is described, which decreases the number of steps from 12 to 8 and increases the overall yield by 17-fold.


Asunto(s)
Conservadores de la Densidad Ósea/farmacología , Simulación por Computador , Inhibidores Enzimáticos/farmacología , Ácido Etidrónico/análogos & derivados , Farnesiltransferasa/antagonistas & inhibidores , Organofosfonatos/farmacología , Piridinas/farmacología , Algoritmos , Sitios de Unión , Conservadores de la Densidad Ósea/síntesis química , Carcinoma/tratamiento farmacológico , Carcinoma/enzimología , Cristalografía por Rayos X , Difosfonatos/química , Difosfonatos/farmacología , Inhibidores Enzimáticos/síntesis química , Ácido Etidrónico/química , Ácido Etidrónico/farmacología , Humanos , Imidazoles/química , Imidazoles/farmacología , Concentración 50 Inhibidora , Magnesio/química , Magnesio/metabolismo , Modelos Químicos , Organofosfonatos/síntesis química , Piridinas/síntesis química , Ácido Risedrónico , Estereoisomerismo , Relación Estructura-Actividad , Ácido Zoledrónico
10.
Genome Inform ; 19: 178-89, 2007.
Artículo en Inglés | MEDLINE | ID: mdl-18546515

RESUMEN

In silico approaches to the identification of bacterial promoters are hampered by poor conservation of their characteristic binding sites. This suggests that the usual position weight matrix models of bacterial promoters are incomplete. A number of methods have been used to overcome this inadequacy, one of which is to incorporate structural properties of DNA. In this paper we describe an extension of the promoter description to include SIDD (stress induced duplex destabilization), DNA curvature and stacking energy. Although we report the best result to date for a realistic promoter prediction task, surprisingly, DNA structural properties did not contribute significantly to this result. We also demonstrate for the first time, that sigma-54 promoters have a stronger association with SIDD than do other promoter types.


Asunto(s)
Biología Computacional/métodos , Genoma Bacteriano , Regiones Promotoras Genéticas , Algoritmos , ADN/química , ADN Bacteriano/genética , Escherichia coli/genética , Proteínas de Escherichia coli/genética , Modelos Genéticos , Conformación de Ácido Nucleico , ARN Polimerasa Sigma 54/genética , Reproducibilidad de los Resultados , Programas Informáticos
11.
Int J Neural Syst ; 16(5): 363-70, 2006 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-17117497

RESUMEN

Identifying promoters is the key to understanding gene expression in bacteria. Promoters lie in tightly constrained positions relative to the transcription start site (TSS). In this paper, we address the problem of predicting transcription start sites in Escherichia coli. Knowing the TSS position, one can then predict the promoter position to within a few base pairs, and vice versa. The accepted method for promoter prediction is to use a pair of position weight matrices (PWMs), which define conserved motifs at the sigma-factor binding site. However this method is known to result in a large number of false positive predictions, thereby limiting its usefulness to the experimental biologist. We adopt an alternative approach based on the Support Vector Machine (SVM) using a modified mismatch spectrum kernel. Our modifications involve tagging the motifs with their location, and selectively pruning the feature set. We quantify the performance of several SVM models and a PWM model using a performance metric of area under the detection-error tradeoff (DET) curve. SVM models are shown to outperform the PWM on a biologically realistic TSS prediction task. We also describe a more broadly applicable peak scoring technique which reduces the number of false positive predictions, greatly enhancing the utility of our results.


Asunto(s)
Escherichia coli/genética , Regulación Bacteriana de la Expresión Génica/genética , Regiones Promotoras Genéticas/genética , Sitio de Iniciación de la Transcripción/fisiología , Transcripción Genética/genética , Inteligencia Artificial
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...