Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 6 de 6
Filtrar
1.
Nat Methods ; 20(1): 104-111, 2023 01.
Artículo en Inglés | MEDLINE | ID: mdl-36522501

RESUMEN

Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.


Asunto(s)
Algoritmos , Proteínas , Secuencia de Aminoácidos , Proteínas/genética , Proteínas/química , Alineación de Secuencia , Genómica
2.
Bioinformatics ; 35(15): 2680-2682, 2019 08 01.
Artículo en Inglés | MEDLINE | ID: mdl-30541062

RESUMEN

SUMMARY: Combinatorial association mapping aims to assess the statistical association of higher-order interactions of genetic markers with a phenotype of interest. This article presents combinatorial association mapping (CASMAP), a software package that leverages recent advances in significant pattern mining to overcome the statistical and computational challenges that have hindered combinatorial association mapping. CASMAP can be used to perform region-based association studies and to detect higher-order epistatic interactions of genetic variants. Most importantly, unlike other existing significant pattern mining-based tools, CASMAP allows for the correction of categorical covariates such as age or gender, making it suitable for genome-wide association studies. AVAILABILITY AND IMPLEMENTATION: The R and Python packages can be downloaded from our GitHub repository http://github.com/BorgwardtLab/CASMAP. The R package is also available on CRAN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Polimorfismo de Nucleótido Simple , Estudio de Asociación del Genoma Completo , Fenotipo , Programas Informáticos
3.
Bioinformatics ; 34(3): 530-532, 2018 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-29028902

RESUMEN

Summary: Measuring the similarity of graphs is a fundamental step in the analysis of graph-structured data, which is omnipresent in computational biology. Graph kernels have been proposed as a powerful and efficient approach to this problem of graph comparison. Here we provide graphkernels, the first R and Python graph kernel libraries including baseline kernels such as label histogram based kernels, classic graph kernels such as random walk based kernels, and the state-of-the-art Weisfeiler-Lehman graph kernel. The core of all graph kernels is implemented in C ++ for efficiency. Using the kernel matrices computed by the package, we can easily perform tasks such as classification, regression and clustering on graph-structured samples. Availability and implementation: The R and Python packages including source code are available at https://CRAN.R-project.org/package=graphkernels and https://pypi.python.org/pypi/graphkernels. Contact: mahito@nii.ac.jp or elisabetta.ghisu@bsse.ethz.ch. Supplementary information: Supplementary data are available online at Bioinformatics.


Asunto(s)
Biología Computacional/métodos , Programas Informáticos
4.
Bioinformatics ; 33(12): 1820-1828, 2017 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-28200033

RESUMEN

MOTIVATION: Genetic heterogeneity is the phenomenon that distinct genetic variants may give rise to the same phenotype. The recently introduced algorithm Fast Automatic Interval Search ( FAIS ) enables the genome-wide search of candidate regions for genetic heterogeneity in the form of any contiguous sequence of variants, and achieves high computational efficiency and statistical power. Although FAIS can test all possible genomic regions for association with a phenotype, a key limitation is its inability to correct for confounders such as gender or population structure, which may lead to numerous false-positive associations. RESULTS: We propose FastCMH , a method that overcomes this problem by properly accounting for categorical confounders, while still retaining statistical power and computational efficiency. Experiments comparing FastCMH with FAIS and multiple kinds of burden tests on simulated data, as well as on human and Arabidopsis samples, demonstrate that FastCMH can drastically reduce genomic inflation and discover associations that are missed by standard burden tests. AVAILABILITY AND IMPLEMENTATION: An R package fastcmh is available on CRAN and the source code can be found at: https://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/fastcmh.html. CONTACT: felipe.llinares@bsse.ethz.ch. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Heterogeneidad Genética , Genómica/métodos , Programas Informáticos , Algoritmos , Arabidopsis/genética , Femenino , Genética de Población/métodos , Humanos , Masculino
5.
Bioinformatics ; 31(12): i240-9, 2015 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-26072488

RESUMEN

MOTIVATION: Genetic heterogeneity, the fact that several sequence variants give rise to the same phenotype, is a phenomenon that is of the utmost interest in the analysis of complex phenotypes. Current approaches for finding regions in the genome that exhibit genetic heterogeneity suffer from at least one of two shortcomings: (i) they require the definition of an exact interval in the genome that is to be tested for genetic heterogeneity, potentially missing intervals of high relevance, or (ii) they suffer from an enormous multiple hypothesis testing problem due to the large number of potential candidate intervals being tested, which results in either many false positives or a lack of power to detect true intervals. RESULTS: Here, we present an approach that overcomes both problems: it allows one to automatically find all contiguous sequences of single nucleotide polymorphisms in the genome that are jointly associated with the phenotype. It also solves both the inherent computational efficiency problem and the statistical problem of multiple hypothesis testing, which are both caused by the huge number of candidate intervals. We demonstrate on Arabidopsis thaliana genome-wide association study data that our approach can discover regions that exhibit genetic heterogeneity and would be missed by single-locus mapping. CONCLUSIONS: Our novel approach can contribute to the genome-wide discovery of intervals that are involved in the genetic heterogeneity underlying complex phenotypes. AVAILABILITY AND IMPLEMENTATION: The code can be obtained at: http://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/sis.html.


Asunto(s)
Heterogeneidad Genética , Estudio de Asociación del Genoma Completo/métodos , Polimorfismo de Nucleótido Simple , Algoritmos , Arabidopsis/genética , Fenotipo
6.
Nat Biotechnol ; 41(2): 232-238, 2023 02.
Artículo en Inglés | MEDLINE | ID: mdl-36050551

RESUMEN

Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10-25 kilobases), accurate 'HiFi' reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformer-encoder for sequence correction. Compared to pbccs, DeepConsensus reduces read errors by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27% and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9 megabases (Mb) to 17.2 Mb), increase gene completeness (94% to 97%), reduce the false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45) and reduce variant-calling errors by 24%. DeepConsensus models could be trained to the general problem of analyzing the alignment of other types of sequences, such as unique molecular identifiers or genome assemblies.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA