Búsqueda | Portal Regional de la BVS

1.

Fast and Accurate Multiple Sequence Alignment with MSAProbs-MPI.

González-Domínguez, Jorge.

Methods Mol Biol ; 2231: 39-47, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-33289885

RESUMEN

Multiple sequence alignment (MSA) is a central step in many bioinformatics and computational biology analyses. Although there exist many methods to perform MSA, most of them fail when dealing with large datasets due to their high computational cost. MSAProbs-MPI is a publicly available tool ( http://msaprobs.sourceforge.net ) that provides highly accurate results in relatively short runtime thanks to exploiting the hardware resources of multicore clusters. In this chapter, I explain the statistical and biological concepts employed in MSAProbs-MPI to complete the alignments, as well as the high-performance computing techniques used to accelerate it. Moreover, I provide some hints about the configuration parameters that should be used to guarantee high-performance executions.

Asunto(s)

Biología Computacional/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Algoritmos , Biología Computacional/instrumentación , Metodologías Computacionales , Alineación de Secuencia/instrumentación

2.

Multiple Sequence Alignment Computation Using the T-Coffee Regressive Algorithm Implementation.

Garriga, Edgar; Di Tommaso, Paolo; Magis, Cedrik; Erb, Ionas; Mansouri, Leila; Baltzis, Athanasios; Floden, Evan; Notredame, Cedric.

Methods Mol Biol ; 2231: 89-97, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-33289888

RESUMEN

Many fields of biology rely on the inference of accurate multiple sequence alignments (MSA) of biological sequences. Unfortunately, the problem of assembling an MSA is NP-complete thus limiting computation to approximate solutions using heuristics solutions. The progressive algorithm is one of the most popular frameworks for the computation of MSAs. It involves pre-clustering the sequences and aligning them starting with the most similar ones. The scalability of this framework is limited, especially with respect to accuracy. We present here an alternative approach named regressive algorithm. In this framework, sequences are first clustered and then aligned starting with the most distantly related ones. This approach has been shown to greatly improve accuracy during scale-up, especially on datasets featuring 10,000 sequences or more. Another benefit is the possibility to integrate third-party clustering methods and third-party MSA aligners. The regressive algorithm has been tested on up to 1.5 million sequences, its implementation is available in the T-Coffee package.

Asunto(s)

Biología Computacional/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Algoritmos , Análisis por Conglomerados , Biología Computacional/instrumentación , Alineación de Secuencia/instrumentación

3.

Mustguseal and Sister Web-Methods: A Practical Guide to Bioinformatic Analysis of Protein Superfamilies.

Suplatov, Dmitry; Sharapova, Yana; Svedas, Vytas.

Methods Mol Biol ; 2231: 179-200, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-33289894

RESUMEN

Bioinformatic analysis of functionally diverse superfamilies can help to study the structure-function relationship in proteins, but represents a methodological challenge. The Mustguseal web-server can build large structure-guided sequence alignments of thousands of homologs that cover all currently available sequence variants within a common structural fold. The input to the method is a PDB code of the query protein, which represents the protein superfamily of interest. The collection and subsequent alignment of protein sequences and structures is fully automated and driven by the particular choice of parameters. Four integrated sister web-methods-the Zebra, pocketZebra, visualCMAT, and Yosshi-are available to further analyze the resulting superimposition and identify conserved, subfamily-specific, and co-evolving residues, as well as to classify and study disulfide bonds in protein superfamilies. The integration of these web-based bioinformatic tools provides an out-of-the-box easy-to-use solution, first of its kind, to study protein function and regulation and design improved enzyme variants for practical applications and selective ligands to modulate their functional properties. In this chapter, we provide a step-by-step protocol for a comprehensive bioinformatic analysis of a protein superfamily using a web-browser as the main tool and notes on selecting the appropriate values for the key algorithm parameters depending on your research objective. The web-servers are freely available to all users at https://biokinet.belozersky.msu.ru/m-platform with no login requirement.

Asunto(s)

Biología Computacional/métodos , Proteínas/química , Alineación de Secuencia/métodos , Programas Informáticos , Algoritmos , Secuencia de Aminoácidos , Biología Computacional/instrumentación , Disulfuros/química , Internet , Ligandos , Estructura Terciaria de Proteína , Alineación de Secuencia/instrumentación

4.

Accelerating Sequence Alignments Based on FM-Index Using the Intel KNL Processor.

Herruzo, Jose M; Gonzalez-Navarro, Sonia; Ibanez-Marin, Pablo; Vinals-Yufera, Victor; Alastruey-Benede, Jesus; Plata, Oscar.

IEEE/ACM Trans Comput Biol Bioinform ; 17(4): 1093-1104, 2020.

Artículo en Inglés | MEDLINE | ID: mdl-30530369

RESUMEN

FM-index is a compact data structure suitable for fast matches of short reads to large reference genomes. The matching algorithm using this index exhibits irregular memory access patterns that cause frequent cache misses, resulting in a memory bound problem. This paper analyzes different FM-index versions presented in the literature, focusing on those computing aspects related to the data access. As a result of the analysis, we propose a new organization of FM-index that minimizes the demand for memory bandwidth, allowing a great improvement of performance on processors with high-bandwidth memory, such as the second-generation Intel Xeon Phi (Knights Landing, or KNL), integrating ultra high-bandwidth stacked memory technology. As the roofline model shows, our implementation reaches 95 percent of the peak random access bandwidth limit when executed on the KNL and almost all of the available bandwidth when executed on other Intel Xeon architectures with conventional DDR memory. In addition, the obtained throughput in KNL is much higher than the results reported for GPUs in the literature.

Asunto(s)

Genómica , Alineación de Secuencia , Algoritmos , Computadores , ADN/genética , Genoma Humano/genética , Genómica/instrumentación , Genómica/métodos , Humanos , Alineación de Secuencia/instrumentación , Alineación de Secuencia/métodos

5.

BLASTP-ACC: Parallel Architecture and Hardware Accelerator Design for BLAST-Based Protein Sequence Alignment.

Li, Yu-Cheng; Lu, Yi-Chang.

IEEE Trans Biomed Circuits Syst ; 13(6): 1771-1782, 2019 12.

Artículo en Inglés | MEDLINE | ID: mdl-31581096

RESUMEN

In this study, we design a hardware accelerator for a widely used sequence alignment algorithm, the basic local alignment search tool for proteins (BLASTP). The architecture of the proposed accelerator consists of five stages: a new systolic-array-based one-hit finding stage, a novel RAM-REG-based two-hit finding stage, a refined ungapped extension stage, a faster gapped extension stage, and a highly efficient parallel sorter. The system is implemented on an Altera Stratix V FPGA with a processing speed of more than 500 giga cell updates per second (GCUPS). It can receive a query sequence, compare it with the sequences in the database, and generate a list sorted in descending order of the similarity scores between the query sequence and the subject sequences. Moreover, it is capable of processing both query and subject protein sequences comprising as many as 8192 amino acid residues in a single pass. Using data from the National Center for Biotechnology Information (NCBI) database, we show that a speed-up of more than 3X can be achieved with our hardware compared to the runtime required by BLASTP software on an 8-thread Intel Xeon CPU with 144 GB DRAM.

Asunto(s)

Proteínas/genética , Alineación de Secuencia/instrumentación , Secuencia de Aminoácidos , Bases de Datos Factuales , Diseño de Equipo , Alineación de Secuencia/métodos

6.

Freiburg RNA tools: a central online resource for RNA-focused research and teaching.

Raden, Martin; Ali, Syed M; Alkhnbashi, Omer S; Busch, Anke; Costa, Fabrizio; Davis, Jason A; Eggenhofer, Florian; Gelhausen, Rick; Georg, Jens; Heyne, Steffen; Hiller, Michael; Kundu, Kousik; Kleinkauf, Robert; Lott, Steffen C; Mohamed, Mostafa M; Mattheis, Alexander; Miladi, Milad; Richter, Andreas S; Will, Sebastian; Wolff, Joachim; Wright, Patrick R; Backofen, Rolf.

Nucleic Acids Res ; 46(W1): W25-W29, 2018 07 02.

Artículo en Inglés | MEDLINE | ID: mdl-29788132

RESUMEN

The Freiburg RNA tools webserver is a well established online resource for RNA-focused research. It provides a unified user interface and comprehensive result visualization for efficient command line tools. The webserver includes RNA-RNA interaction prediction (IntaRNA, CopraRNA, metaMIR), sRNA homology search (GLASSgo), sequence-structure alignments (LocARNA, MARNA, CARNA, ExpaRNA), CRISPR repeat classification (CRISPRmap), sequence design (antaRNA, INFO-RNA, SECISDesign), structure aberration evaluation of point mutations (RaSE), and RNA/protein-family models visualization (CMV), and other methods. Open education resources offer interactive visualizations of RNA structure and RNA-RNA interaction prediction as well as basic and advanced sequence alignment algorithms. The services are freely available at http://rna.informatik.uni-freiburg.de.

Asunto(s)

Secuencia de Bases/genética , Internet , ARN/genética , Programas Informáticos , Algoritmos , Conformación de Ácido Nucleico , ARN/química , Alineación de Secuencia/instrumentación , Análisis de Secuencia de ARN/instrumentación , Relación Estructura-Actividad

7.

Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism.

Yu, Jia; Blom, Jochen; Sczyrba, Alexander; Goesmann, Alexander.

J Biotechnol ; 257: 58-60, 2017 Sep 10.

Artículo en Inglés | MEDLINE | ID: mdl-28232083

RESUMEN

The introduction of next generation sequencing has caused a steady increase in the amounts of data that have to be processed in modern life science. Sequence alignment plays a key role in the analysis of sequencing data e.g. within whole genome sequencing or metagenome projects. BLAST is a commonly used alignment tool that was the standard approach for more than two decades, but in the last years faster alternatives have been proposed including RapSearch, GHOSTX, and DIAMOND. Here we introduce HAMOND, an application that uses Apache Hadoop to parallelize DIAMOND computation in order to scale-out the calculation of alignments. HAMOND is fault tolerant and scalable by utilizing large cloud computing infrastructures like Amazon Web Services. HAMOND has been tested in comparative genomics analyses and showed promising results both in efficiency and accuracy.

Asunto(s)

Nube Computacional , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Hibridación Genómica Comparativa , Biología Computacional , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/instrumentación , Internet , Metagenoma , Datos de Secuencia Molecular , Alineación de Secuencia/instrumentación , Análisis de Secuencia de ADN/instrumentación , Programas Informáticos , Secuenciación Completa del Genoma

8.

TRIC: an automated alignment strategy for reproducible protein quantification in targeted proteomics.

Röst, Hannes L; Liu, Yansheng; D'Agostino, Giuseppe; Zanella, Matteo; Navarro, Pedro; Rosenberger, George; Collins, Ben C; Gillet, Ludovic; Testa, Giuseppe; Malmström, Lars; Aebersold, Ruedi.

Nat Methods ; 13(9): 777-83, 2016 09.

Artículo en Inglés | MEDLINE | ID: mdl-27479329

RESUMEN

Next-generation mass spectrometric (MS) techniques such as SWATH-MS have substantially increased the throughput and reproducibility of proteomic analysis, but ensuring consistent quantification of thousands of peptide analytes across multiple liquid chromatography-tandem MS (LC-MS/MS) runs remains a challenging and laborious manual process. To produce highly consistent and quantitatively accurate proteomics data matrices in an automated fashion, we developed TRIC (http://proteomics.ethz.ch/tric/), a software tool that utilizes fragment-ion data to perform cross-run alignment, consistent peak-picking and quantification for high-throughput targeted proteomics. TRIC reduced the identification error compared to a state-of-the-art SWATH-MS analysis without alignment by more than threefold at constant recall while correcting for highly nonlinear chromatographic effects. On a pulsed-SILAC experiment performed on human induced pluripotent stem cells, TRIC was able to automatically align and quantify thousands of light and heavy isotopic peak groups. Thus, TRIC fills a gap in the pipeline for automated analysis of massively parallel targeted proteomics data sets.

Asunto(s)

Procesamiento Automatizado de Datos/métodos , Péptidos/análisis , Proteómica/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Algoritmos , Procesamiento Automatizado de Datos/instrumentación , Humanos , Espectrometría de Masas , Péptidos/metabolismo , Células Madre Pluripotentes/metabolismo , Precursores de Proteínas/análisis , Precursores de Proteínas/metabolismo , Proteolisis , Proteómica/instrumentación , Reproducibilidad de los Resultados , Alineación de Secuencia/instrumentación , Análisis de Secuencia de Proteína/instrumentación , Streptococcus pyogenes/metabolismo

9.

B-MIC: An Ultrafast Three-Level Parallel Sequence Aligner Using MIC.

Cui, Yingbo; Liao, Xiangke; Zhu, Xiaoqian; Wang, Bingqiang; Peng, Shaoliang.

Interdiscip Sci ; 8(1): 28-34, 2016 Mar.

Artículo en Inglés | MEDLINE | ID: mdl-26358141

RESUMEN

Sequence alignment is the central process for sequence analysis, where mapping raw sequencing data to reference genome. The large amount of data generated by NGS is far beyond the process capabilities of existing alignment tools. Consequently, sequence alignment becomes the bottleneck of sequence analysis. Intensive computing power is required to address this challenge. Intel recently announced the MIC coprocessor, which can provide massive computing power. The Tianhe-2 is the world's fastest supercomputer now equipped with three MIC coprocessors each compute node. A key feature of sequence alignment is that different reads are independent. Considering this property, we proposed a MIC-oriented three-level parallelization strategy to speed up BWA, a widely used sequence alignment tool, and developed our ultrafast parallel sequence aligner: B-MIC. B-MIC contains three levels of parallelization: firstly, parallelization of data IO and reads alignment by a three-stage parallel pipeline; secondly, parallelization enabled by MIC coprocessor technology; thirdly, inter-node parallelization implemented by MPI. In this paper, we demonstrate that B-MIC outperforms BWA by a combination of those techniques using Inspur NF5280M server and the Tianhe-2 supercomputer. To the best of our knowledge, B-MIC is the first sequence alignment tool to run on Intel MIC and it can achieve more than fivefold speedup over the original BWA while maintaining the alignment precision.

Asunto(s)

Computadores , Alineación de Secuencia/instrumentación , Análisis de Secuencia de ADN/instrumentación , Programas Informáticos , Algoritmos

10.

NGS-QCbox and Raspberry for Parallel, Automated and Rapid Quality Control Analysis of Large-Scale Next Generation Sequencing (Illumina) Data.

Katta, Mohan A V S K; Khan, Aamir W; Doddamani, Dadakhalandar; Thudi, Mahendar; Varshney, Rajeev K.

PLoS One ; 10(10): e0139868, 2015.

Artículo en Inglés | MEDLINE | ID: mdl-26460497

RESUMEN

Rapid popularity and adaptation of next generation sequencing (NGS) approaches have generated huge volumes of data. High throughput platforms like Illumina HiSeq produce terabytes of raw data that requires quick processing. Quality control of the data is an important component prior to the downstream analyses. To address these issues, we have developed a quality control pipeline, NGS-QCbox that scales up to process hundreds or thousands of samples. Raspberry is an in-house tool, developed in C language utilizing HTSlib (v1.2.1) (http://htslib.org), for computing read/base level statistics. It can be used as stand-alone application and can process both compressed and uncompressed FASTQ format files. NGS-QCbox integrates Raspberry with other open-source tools for alignment (Bowtie2), SNP calling (SAMtools) and other utilities (bedtools) towards analyzing raw NGS data at higher efficiency and in high-throughput manner. The pipeline implements batch processing of jobs using Bpipe (https://github.com/ssadedin/bpipe) in parallel and internally, a fine grained task parallelization utilizing OpenMP. It reports read and base statistics along with genome coverage and variants in a user friendly format. The pipeline developed presents a simple menu driven interface and can be used in either quick or complete mode. In addition, the pipeline in quick mode outperforms in speed against other similar existing QC pipeline/tools. The NGS-QCbox pipeline, Raspberry tool and associated scripts are made available at the URL https://github.com/CEG-ICRISAT/NGS-QCbox and https://github.com/CEG-ICRISAT/Raspberry for rapid quality control analysis of large-scale next generation sequencing (Illumina) data.

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento/instrumentación , Internet , Alineación de Secuencia/instrumentación , Alineación de Secuencia/métodos , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos

11.

Concurrent and Accurate Short Read Mapping on Multicore Processors.

Martínez, Héctor; Tárraga, Joaquín; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquín; Quintana-Ortí, Enrique S.

IEEE/ACM Trans Comput Biol Bioinform ; 12(5): 995-1007, 2015.

Artículo en Inglés | MEDLINE | ID: mdl-26451814

RESUMEN

We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.

Asunto(s)

Mapeo Cromosómico/instrumentación , Secuenciación de Nucleótidos de Alto Rendimiento/instrumentación , ARN/genética , Análisis de Secuencia de ARN/instrumentación , Procesamiento de Señales Asistido por Computador/instrumentación , Programas Informáticos , Secuencia de Bases , Mapeo Cromosómico/métodos , Diseño de Equipo , Análisis de Falla de Equipo , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Datos de Secuencia Molecular , Reproducibilidad de los Resultados , Sensibilidad y Especificidad , Alineación de Secuencia/instrumentación , Alineación de Secuencia/métodos , Análisis de Secuencia de ARN/métodos

12.

Efficient and Accurate OTU Clustering with GPU-Based Sequence Alignment and Dynamic Dendrogram Cutting.

Nguyen, Thuy-Diem; Schmidt, Bertil; Zheng, Zejun; Kwoh, Chee-Keong.

IEEE/ACM Trans Comput Biol Bioinform ; 12(5): 1060-73, 2015.

Artículo en Inglés | MEDLINE | ID: mdl-26451819

RESUMEN

De novo clustering is a popular technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). In this work, we introduce a new dendrogram-based OTU clustering pipeline called CRiSPy. The key idea used in CRiSPy to improve clustering accuracy is the application of an anomaly detection technique to obtain a dynamic distance cutoff instead of using the de facto value of 97 percent sequence similarity as in most existing OTU clustering pipelines. This technique works by detecting an abrupt change in the merging heights of a dendrogram. To produce the output dendrograms, CRiSPy employs the OTU hierarchical clustering approach that is computed on a genetic distance matrix derived from an all-against-all read comparison by pairwise sequence alignment. However, most existing dendrogram-based tools have difficulty processing datasets larger than 10,000 unique reads due to high computational complexity. We address this difficulty by developing two efficient algorithms for CRiSPy: a compute-efficient GPU-accelerated parallel algorithm for pairwise distance matrix computation and a memory-efficient hierarchical clustering algorithm. Our experiments on various datasets with distinct attributes show that CRiSPy is able to produce more accurate OTU groupings than most OTU clustering applications.

Asunto(s)

Algoritmos , Gráficos por Computador/instrumentación , Secuenciación de Nucleótidos de Alto Rendimiento/instrumentación , ARN Bacteriano/genética , ARN Ribosómico 16S/genética , Alineación de Secuencia/instrumentación , Secuencia de Bases , Diseño de Equipo , Análisis de Falla de Equipo , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Datos de Secuencia Molecular , Reconocimiento de Normas Patrones Automatizadas/métodos , Alineación de Secuencia/métodos , Procesamiento de Señales Asistido por Computador/instrumentación

13.

Accelerating the Next Generation Long Read Mapping with the FPGA-Based System.

Chen, Peng; Wang, Chao; Li, Xi; Zhou, Xuehai.

IEEE/ACM Trans Comput Biol Bioinform ; 11(5): 840-52, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-26356857

RESUMEN

To compare the newly determined sequences against the subject sequences stored in the databases is a critical job in the bioinformatics. Fortunately, recent survey reports that the state-of-the-art aligners are already fast enough to handle the ultra amount of short sequence reads in the reasonable time. However, for aligning the long sequence reads (>400 bp) generated by the next generation sequencing (NGS) technology, it is still quite inefficient with present aligners. Furthermore, the challenge becomes more and more serious as the lengths and the amounts of the sequence reads are both keeping increasing with the improvement of the sequencing technology. Thus, it is extremely urgent for the researchers to enhance the performance of the long read alignment. In this paper, we propose a novel FPGA-based system to improve the efficiency of the long read mapping. Compared to the state-of-the-art long read aligner BWA-SW, our accelerating platform could achieve a high performance with almost the same sensitivity. Experiments demonstrate that, for reads with lengths ranging from 512 up to 4,096 base pairs, the described system obtains a 10x -48x speedup for the bottleneck of the software. As to the whole mapping procedure, the FPGA-based platform could achieve a 1.8x -3:3x speedup versus the BWA-SW aligner, reducing the alignment cycles from weeks to days.

Asunto(s)

Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Diseño de Equipo , Genómica/instrumentación , Secuenciación de Nucleótidos de Alto Rendimiento/instrumentación , Humanos , Alineación de Secuencia/instrumentación , Análisis de Secuencia de ADN/instrumentación

14.

Fragment merger: an online tool to merge overlapping long sequence fragments.

Bell, Trevor G; Kramvis, Anna.

Viruses ; 5(3): 824-33, 2013 Mar 12.

Artículo en Inglés | MEDLINE | ID: mdl-23482300

RESUMEN

While PCR amplicons extend to a few thousand bases, the length of sequences from direct Sanger sequencing is limited to 500-800 nucleotides. Therefore, several fragments may be required to cover an amplicon, a gene or an entire genome. These fragments are typically sequenced in an overlapping fashion and assembled by manually sliding and aligning the sequences visually. This is time-consuming, repetitive and error-prone, and further complicated by circular genomes. An online tool merging two to twelve long overlapping sequence fragments was developed. Either chromatograms or FASTA files are submitted to the tool, which trims poor quality ends of chromatograms according to user-specified parameters. Fragments are assembled into a single sequence by repeatedly calling the EMBOSS merger tool in a consecutive manner. Output includes the number of trimmed nucleotides, details of each merge, and an optional alignment to a reference sequence. The final merge sequence is displayed and can be downloaded in FASTA format. All output files can be downloaded as a ZIP archive. This tool allows for easy and automated assembly of overlapping sequences and is aimed at researchers without specialist computer skills. The tool is genome- and organism-agnostic and has been developed using hepatitis B virus sequence data.

Asunto(s)

Virus de la Hepatitis B/genética , Sistemas en Línea/instrumentación , Alineación de Secuencia/instrumentación , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Bases de Datos de Ácidos Nucleicos , Virus de la Hepatitis B/química , Internet , Análisis de Secuencia de ADN/instrumentación , Programas Informáticos

15.

HAL: a hierarchical format for storing and analyzing multiple genome alignments.

Hickey, Glenn; Paten, Benedict; Earl, Dent; Zerbino, Daniel; Haussler, David.

Bioinformatics ; 29(10): 1341-2, 2013 May 15.

Artículo en Inglés | MEDLINE | ID: mdl-23505295

RESUMEN

MOTIVATION: Large multiple genome alignments and inferred ancestral genomes are ideal resources for comparative studies of molecular evolution, and advances in sequencing and computing technology are making them increasingly obtainable. These structures can provide a rich understanding of the genetic relationships between all subsets of species they contain. Current formats for storing genomic alignments, such as XMFA and MAF, are all indexed or ordered using a single reference genome, however, which limits the information that can be queried with respect to other species and clades. This loss of information grows with the number of species under comparison, as well as their phylogenetic distance. RESULTS: We present HAL, a compressed, graph-based hierarchical alignment format for storing multiple genome alignments and ancestral reconstructions. HAL graphs are indexed on all genomes they contain. Furthermore, they are organized phylogenetically, which allows for modular and parallel access to arbitrary subclades without fragmentation because of rearrangements that have occurred in other lineages. HAL graphs can be created or read with a comprehensive C++ API. A set of tools is also provided to perform basic operations, such as importing and exporting data, identifying mutations and coordinate mapping (liftover). AVAILABILITY: All documentation and source code for the HAL API and tools are freely available at http://github.com/glennhickey/hal. CONTACT: hickey@soe.ucsc.edu or haussler@soe.ucsc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Genoma , Alineación de Secuencia/métodos , Programas Informáticos , Animales , Secuencia de Bases , Evolución Molecular , Genómica/métodos , Humanos , Filogenia , Lenguajes de Programación , Alineación de Secuencia/instrumentación

16.

QuShape: rapid, accurate, and best-practices quantification of nucleic acid probing information, resolved by capillary electrophoresis.

Karabiber, Fethullah; McGinnis, Jennifer L; Favorov, Oleg V; Weeks, Kevin M.

RNA ; 19(1): 63-73, 2013 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-23188808

RESUMEN

Chemical probing of RNA and DNA structure is a widely used and highly informative approach for examining nucleic acid structure and for evaluating interactions with protein and small-molecule ligands. Use of capillary electrophoresis to analyze chemical probing experiments yields hundreds of nucleotides of information per experiment and can be performed on automated instruments. Extraction of the information from capillary electrophoresis electropherograms is a computationally intensive multistep analytical process, and no current software provides rapid, automated, and accurate data analysis. To overcome this bottleneck, we developed a platform-independent, user-friendly software package, QuShape, that yields quantitatively accurate nucleotide reactivity information with minimal user supervision. QuShape incorporates newly developed algorithms for signal decay correction, alignment of time-varying signals within and across capillaries and relative to the RNA nucleotide sequence, and signal scaling across channels or experiments. An analysis-by-reference option enables multiple, related experiments to be fully analyzed in minutes. We illustrate the usefulness and robustness of QuShape by analysis of RNA SHAPE (selective 2'-hydroxyl acylation analyzed by primer extension) experiments.

Asunto(s)

Electroforesis Capilar/métodos , Sondas de Ácido Nucleico/análisis , Programas Informáticos , Algoritmos , ADN Bacteriano/análisis , ADN Viral/análisis , Electroforesis Capilar/instrumentación , Escherichia coli/genética , Humanos , ARN Bacteriano/análisis , ARN Viral/análisis , Alineación de Secuencia/instrumentación , Alineación de Secuencia/métodos

17.

PAirwise Sequence Comparison (PASC) and its application in the classification of filoviruses.

Bao, Yiming; Chetvernin, Vyacheslav; Tatusova, Tatiana.

Viruses ; 4(8): 1318-27, 2012 08.

Artículo en Inglés | MEDLINE | ID: mdl-23012628

RESUMEN

PAirwise Sequence Comparison (PASC) is a tool that uses genome sequence similarity to help with virus classification. The PASC tool at NCBI uses two methods: local alignment based on BLAST and global alignment based on Needleman-Wunsch algorithm. It works for complete genomes of viruses of several families/groups, and for the family of Filoviridae, it currently includes 52 complete genomes available in GenBank. It has been shown that BLAST-based alignment approach works better for filoviruses, and therefore is recommended for establishing taxon demarcations criteria. When more genome sequences with high divergence become available, these demarcation will most likely become more precise. The tool can compare new genome sequences of filoviruses with the ones already in the database, and propose their taxonomic classification.

Asunto(s)

Filoviridae/clasificación , Alineación de Secuencia/métodos , Secuencia de Bases , Bases de Datos de Ácidos Nucleicos , Filoviridae/química , Filoviridae/genética , Datos de Secuencia Molecular , Alineación de Secuencia/instrumentación , Análisis de Secuencia de ADN , Programas Informáticos

18.

SIMPLEX: cloud-enabled pipeline for the comprehensive analysis of exome sequencing data.

Fischer, Maria; Snajder, Rene; Pabinger, Stephan; Dander, Andreas; Schossig, Anna; Zschocke, Johannes; Trajanoski, Zlatko; Stocker, Gernot.

PLoS One ; 7(8): e41948, 2012.

Artículo en Inglés | MEDLINE | ID: mdl-22870267

RESUMEN

In recent studies, exome sequencing has proven to be a successful screening tool for the identification of candidate genes causing rare genetic diseases. Although underlying targeted sequencing methods are well established, necessary data handling and focused, structured analysis still remain demanding tasks. Here, we present a cloud-enabled autonomous analysis pipeline, which comprises the complete exome analysis workflow. The pipeline combines several in-house developed and published applications to perform the following steps: (a) initial quality control, (b) intelligent data filtering and pre-processing, (c) sequence alignment to a reference genome, (d) SNP and DIP detection, (e) functional annotation of variants using different approaches, and (f) detailed report generation during various stages of the workflow. The pipeline connects the selected analysis steps, exposes all available parameters for customized usage, performs required data handling, and distributes computationally expensive tasks either on a dedicated high-performance computing infrastructure or on the Amazon cloud environment (EC2). The presented application has already been used in several research projects including studies to elucidate the role of rare genetic diseases. The pipeline is continuously tested and is publicly available under the GPL as a VirtualBox or Cloud image at http://simplex.i-med.ac.at; additional supplementary data is provided at http://www.icbi.at/exome.

Asunto(s)

Exoma , Internet , Polimorfismo de Nucleótido Simple , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Alineación de Secuencia/instrumentación , Análisis de Secuencia de ADN/instrumentación

19.

Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power.

Zhang, Yuhong; Misra, Sanchit; Agrawal, Ankit; Patwary, Md Mostofa Ali; Liao, Wei-Keng; Qin, Zhiguang; Choudhary, Alok.

BMC Bioinformatics ; 13 Suppl 5: S3, 2012 Apr 12.

Artículo en Inglés | MEDLINE | ID: mdl-22537007

RESUMEN

BACKGROUND: Pairwise statistical significance has been recognized to be able to accurately identify related sequences, which is a very important cornerstone procedure in numerous bioinformatics applications. However, it is both computationally and data intensive, which poses a big challenge in terms of performance and scalability. RESULTS: We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. By carefully studying the algorithm's data access characteristics, we developed a tile-based scheme that can produce a contiguous data access in the GPU global memory and sustain a large number of threads to achieve a high GPU occupancy. We further extend the parallelization technique to estimate pairwise statistical significance using position-specific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices. The implementation is also extended to take advantage of dual-GPUs. We observe end-to-end speedups of nearly 250 (370) × using single-GPU Tesla C2050 GPU (dual-Tesla C2050) over the CPU implementation using Intel Corei7 CPU 920 processor. CONCLUSIONS: Harvesting the high performance of modern GPUs is a promising approach to accelerate pairwise statistical significance estimation for local sequence alignment.

Asunto(s)

Gráficos por Computador/instrumentación , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Algoritmos , Alineación de Secuencia/instrumentación , Análisis de Secuencia de Proteína/instrumentación , Programas Informáticos

20.

Protein databases on the Internet.

Xu, Dong.

Curr Protoc Mol Biol ; Chapter 19: Unit 19.4., 2012 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-22237858

RESUMEN

Protein databases have become a crucial part of modern biology. Huge amounts of data for protein structures, functions, and particularly sequences are being generated. Searching databases is often the first step in the study of a new protein. Comparison between proteins or between protein families provides information about the relationship between proteins within a genome or across different species, and hence offers much more information than can be obtained by studying only an isolated protein. In addition, secondary databases derived from experimental databases are also widely available. These databases reorganize and annotate the data or provide predictions. The use of multiple databases often helps researchers understand the structure and function of a protein. Although some protein databases are widely known, they are far from being fully utilized in the protein science community. This unit provides a starting point for readers to explore the potential of protein databases on the Internet.

Asunto(s)

Biología Computacional/métodos , Bases de Datos de Proteínas , Proteínas/química , Alineación de Secuencia/métodos , Secuencia de Aminoácidos , Animales , Biología Computacional/instrumentación , Humanos , Internet , Proteínas/genética , Alineación de Secuencia/instrumentación

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA