|

1.

Fast and Accurate Multiple Sequence Alignment with MSAProbs-MPI.

González-Domínguez, Jorge.

Methods Mol Biol ; 2231: 39-47, 2021.

Article En | MEDLINE | ID: mdl-33289885

Multiple sequence alignment (MSA) is a central step in many bioinformatics and computational biology analyses. Although there exist many methods to perform MSA, most of them fail when dealing with large datasets due to their high computational cost. MSAProbs-MPI is a publicly available tool ( http://msaprobs.sourceforge.net ) that provides highly accurate results in relatively short runtime thanks to exploiting the hardware resources of multicore clusters. In this chapter, I explain the statistical and biological concepts employed in MSAProbs-MPI to complete the alignments, as well as the high-performance computing techniques used to accelerate it. Moreover, I provide some hints about the configuration parameters that should be used to guarantee high-performance executions.

Computational Biology/methods , Sequence Alignment/methods , Software , Algorithms , Computational Biology/instrumentation , Computing Methodologies , Sequence Alignment/instrumentation

2.

Multiple Sequence Alignment Computation Using the T-Coffee Regressive Algorithm Implementation.

Garriga, Edgar; Di Tommaso, Paolo; Magis, Cedrik; Erb, Ionas; Mansouri, Leila; Baltzis, Athanasios; Floden, Evan; Notredame, Cedric.

Methods Mol Biol ; 2231: 89-97, 2021.

Article En | MEDLINE | ID: mdl-33289888

Many fields of biology rely on the inference of accurate multiple sequence alignments (MSA) of biological sequences. Unfortunately, the problem of assembling an MSA is NP-complete thus limiting computation to approximate solutions using heuristics solutions. The progressive algorithm is one of the most popular frameworks for the computation of MSAs. It involves pre-clustering the sequences and aligning them starting with the most similar ones. The scalability of this framework is limited, especially with respect to accuracy. We present here an alternative approach named regressive algorithm. In this framework, sequences are first clustered and then aligned starting with the most distantly related ones. This approach has been shown to greatly improve accuracy during scale-up, especially on datasets featuring 10,000 sequences or more. Another benefit is the possibility to integrate third-party clustering methods and third-party MSA aligners. The regressive algorithm has been tested on up to 1.5 million sequences, its implementation is available in the T-Coffee package.

Computational Biology/methods , Sequence Alignment/methods , Software , Algorithms , Cluster Analysis , Computational Biology/instrumentation , Sequence Alignment/instrumentation

3.

Mustguseal and Sister Web-Methods: A Practical Guide to Bioinformatic Analysis of Protein Superfamilies.

Suplatov, Dmitry; Sharapova, Yana; Svedas, Vytas.

Methods Mol Biol ; 2231: 179-200, 2021.

Article En | MEDLINE | ID: mdl-33289894

Bioinformatic analysis of functionally diverse superfamilies can help to study the structure-function relationship in proteins, but represents a methodological challenge. The Mustguseal web-server can build large structure-guided sequence alignments of thousands of homologs that cover all currently available sequence variants within a common structural fold. The input to the method is a PDB code of the query protein, which represents the protein superfamily of interest. The collection and subsequent alignment of protein sequences and structures is fully automated and driven by the particular choice of parameters. Four integrated sister web-methods-the Zebra, pocketZebra, visualCMAT, and Yosshi-are available to further analyze the resulting superimposition and identify conserved, subfamily-specific, and co-evolving residues, as well as to classify and study disulfide bonds in protein superfamilies. The integration of these web-based bioinformatic tools provides an out-of-the-box easy-to-use solution, first of its kind, to study protein function and regulation and design improved enzyme variants for practical applications and selective ligands to modulate their functional properties. In this chapter, we provide a step-by-step protocol for a comprehensive bioinformatic analysis of a protein superfamily using a web-browser as the main tool and notes on selecting the appropriate values for the key algorithm parameters depending on your research objective. The web-servers are freely available to all users at https://biokinet.belozersky.msu.ru/m-platform with no login requirement.

Computational Biology/methods , Proteins/chemistry , Sequence Alignment/methods , Software , Algorithms , Amino Acid Sequence , Computational Biology/instrumentation , Disulfides/chemistry , Internet , Ligands , Protein Structure, Tertiary , Sequence Alignment/instrumentation

4.

Accelerating Sequence Alignments Based on FM-Index Using the Intel KNL Processor.

Herruzo, Jose M; Gonzalez-Navarro, Sonia; Ibanez-Marin, Pablo; Vinals-Yufera, Victor; Alastruey-Benede, Jesus; Plata, Oscar.

IEEE/ACM Trans Comput Biol Bioinform ; 17(4): 1093-1104, 2020.

Article En | MEDLINE | ID: mdl-30530369

FM-index is a compact data structure suitable for fast matches of short reads to large reference genomes. The matching algorithm using this index exhibits irregular memory access patterns that cause frequent cache misses, resulting in a memory bound problem. This paper analyzes different FM-index versions presented in the literature, focusing on those computing aspects related to the data access. As a result of the analysis, we propose a new organization of FM-index that minimizes the demand for memory bandwidth, allowing a great improvement of performance on processors with high-bandwidth memory, such as the second-generation Intel Xeon Phi (Knights Landing, or KNL), integrating ultra high-bandwidth stacked memory technology. As the roofline model shows, our implementation reaches 95 percent of the peak random access bandwidth limit when executed on the KNL and almost all of the available bandwidth when executed on other Intel Xeon architectures with conventional DDR memory. In addition, the obtained throughput in KNL is much higher than the results reported for GPUs in the literature.

Genomics , Sequence Alignment , Algorithms , Computers , DNA/genetics , Genome, Human/genetics , Genomics/instrumentation , Genomics/methods , Humans , Sequence Alignment/instrumentation , Sequence Alignment/methods

5.

BLASTP-ACC: Parallel Architecture and Hardware Accelerator Design for BLAST-Based Protein Sequence Alignment.

Li, Yu-Cheng; Lu, Yi-Chang.

IEEE Trans Biomed Circuits Syst ; 13(6): 1771-1782, 2019 12.

Article En | MEDLINE | ID: mdl-31581096

In this study, we design a hardware accelerator for a widely used sequence alignment algorithm, the basic local alignment search tool for proteins (BLASTP). The architecture of the proposed accelerator consists of five stages: a new systolic-array-based one-hit finding stage, a novel RAM-REG-based two-hit finding stage, a refined ungapped extension stage, a faster gapped extension stage, and a highly efficient parallel sorter. The system is implemented on an Altera Stratix V FPGA with a processing speed of more than 500 giga cell updates per second (GCUPS). It can receive a query sequence, compare it with the sequences in the database, and generate a list sorted in descending order of the similarity scores between the query sequence and the subject sequences. Moreover, it is capable of processing both query and subject protein sequences comprising as many as 8192 amino acid residues in a single pass. Using data from the National Center for Biotechnology Information (NCBI) database, we show that a speed-up of more than 3X can be achieved with our hardware compared to the runtime required by BLASTP software on an 8-thread Intel Xeon CPU with 144 GB DRAM.

Proteins/genetics , Sequence Alignment/instrumentation , Amino Acid Sequence , Databases, Factual , Equipment Design , Sequence Alignment/methods

6.

Freiburg RNA tools: a central online resource for RNA-focused research and teaching.

Raden, Martin; Ali, Syed M; Alkhnbashi, Omer S; Busch, Anke; Costa, Fabrizio; Davis, Jason A; Eggenhofer, Florian; Gelhausen, Rick; Georg, Jens; Heyne, Steffen; Hiller, Michael; Kundu, Kousik; Kleinkauf, Robert; Lott, Steffen C; Mohamed, Mostafa M; Mattheis, Alexander; Miladi, Milad; Richter, Andreas S; Will, Sebastian; Wolff, Joachim; Wright, Patrick R; Backofen, Rolf.

Nucleic Acids Res ; 46(W1): W25-W29, 2018 07 02.

Article En | MEDLINE | ID: mdl-29788132

The Freiburg RNA tools webserver is a well established online resource for RNA-focused research. It provides a unified user interface and comprehensive result visualization for efficient command line tools. The webserver includes RNA-RNA interaction prediction (IntaRNA, CopraRNA, metaMIR), sRNA homology search (GLASSgo), sequence-structure alignments (LocARNA, MARNA, CARNA, ExpaRNA), CRISPR repeat classification (CRISPRmap), sequence design (antaRNA, INFO-RNA, SECISDesign), structure aberration evaluation of point mutations (RaSE), and RNA/protein-family models visualization (CMV), and other methods. Open education resources offer interactive visualizations of RNA structure and RNA-RNA interaction prediction as well as basic and advanced sequence alignment algorithms. The services are freely available at http://rna.informatik.uni-freiburg.de.

Base Sequence/genetics , Internet , RNA/genetics , Software , Algorithms , Nucleic Acid Conformation , RNA/chemistry , Sequence Alignment/instrumentation , Sequence Analysis, RNA/instrumentation , Structure-Activity Relationship

7.

Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism.

Yu, Jia; Blom, Jochen; Sczyrba, Alexander; Goesmann, Alexander.

J Biotechnol ; 257: 58-60, 2017 Sep 10.

Article En | MEDLINE | ID: mdl-28232083

The introduction of next generation sequencing has caused a steady increase in the amounts of data that have to be processed in modern life science. Sequence alignment plays a key role in the analysis of sequencing data e.g. within whole genome sequencing or metagenome projects. BLAST is a commonly used alignment tool that was the standard approach for more than two decades, but in the last years faster alternatives have been proposed including RapSearch, GHOSTX, and DIAMOND. Here we introduce HAMOND, an application that uses Apache Hadoop to parallelize DIAMOND computation in order to scale-out the calculation of alignments. HAMOND is fault tolerant and scalable by utilizing large cloud computing infrastructures like Amazon Web Services. HAMOND has been tested in comparative genomics analyses and showed promising results both in efficiency and accuracy.

Cloud Computing , High-Throughput Nucleotide Sequencing/methods , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Comparative Genomic Hybridization , Computational Biology , Genomics/methods , High-Throughput Nucleotide Sequencing/instrumentation , Internet , Metagenome , Molecular Sequence Data , Sequence Alignment/instrumentation , Sequence Analysis, DNA/instrumentation , Software , Whole Genome Sequencing

8.

TRIC: an automated alignment strategy for reproducible protein quantification in targeted proteomics.

Röst, Hannes L; Liu, Yansheng; D'Agostino, Giuseppe; Zanella, Matteo; Navarro, Pedro; Rosenberger, George; Collins, Ben C; Gillet, Ludovic; Testa, Giuseppe; Malmström, Lars; Aebersold, Ruedi.

Nat Methods ; 13(9): 777-83, 2016 09.

Article En | MEDLINE | ID: mdl-27479329

Next-generation mass spectrometric (MS) techniques such as SWATH-MS have substantially increased the throughput and reproducibility of proteomic analysis, but ensuring consistent quantification of thousands of peptide analytes across multiple liquid chromatography-tandem MS (LC-MS/MS) runs remains a challenging and laborious manual process. To produce highly consistent and quantitatively accurate proteomics data matrices in an automated fashion, we developed TRIC (http://proteomics.ethz.ch/tric/), a software tool that utilizes fragment-ion data to perform cross-run alignment, consistent peak-picking and quantification for high-throughput targeted proteomics. TRIC reduced the identification error compared to a state-of-the-art SWATH-MS analysis without alignment by more than threefold at constant recall while correcting for highly nonlinear chromatographic effects. On a pulsed-SILAC experiment performed on human induced pluripotent stem cells, TRIC was able to automatically align and quantify thousands of light and heavy isotopic peak groups. Thus, TRIC fills a gap in the pipeline for automated analysis of massively parallel targeted proteomics data sets.

Electronic Data Processing/methods , Peptides/analysis , Proteomics/methods , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Software , Algorithms , Electronic Data Processing/instrumentation , Humans , Mass Spectrometry , Peptides/metabolism , Pluripotent Stem Cells/metabolism , Protein Precursors/analysis , Protein Precursors/metabolism , Proteolysis , Proteomics/instrumentation , Reproducibility of Results , Sequence Alignment/instrumentation , Sequence Analysis, Protein/instrumentation , Streptococcus pyogenes/metabolism

9.

B-MIC: An Ultrafast Three-Level Parallel Sequence Aligner Using MIC.

Cui, Yingbo; Liao, Xiangke; Zhu, Xiaoqian; Wang, Bingqiang; Peng, Shaoliang.

Interdiscip Sci ; 8(1): 28-34, 2016 Mar.

Article En | MEDLINE | ID: mdl-26358141

Sequence alignment is the central process for sequence analysis, where mapping raw sequencing data to reference genome. The large amount of data generated by NGS is far beyond the process capabilities of existing alignment tools. Consequently, sequence alignment becomes the bottleneck of sequence analysis. Intensive computing power is required to address this challenge. Intel recently announced the MIC coprocessor, which can provide massive computing power. The Tianhe-2 is the world's fastest supercomputer now equipped with three MIC coprocessors each compute node. A key feature of sequence alignment is that different reads are independent. Considering this property, we proposed a MIC-oriented three-level parallelization strategy to speed up BWA, a widely used sequence alignment tool, and developed our ultrafast parallel sequence aligner: B-MIC. B-MIC contains three levels of parallelization: firstly, parallelization of data IO and reads alignment by a three-stage parallel pipeline; secondly, parallelization enabled by MIC coprocessor technology; thirdly, inter-node parallelization implemented by MPI. In this paper, we demonstrate that B-MIC outperforms BWA by a combination of those techniques using Inspur NF5280M server and the Tianhe-2 supercomputer. To the best of our knowledge, B-MIC is the first sequence alignment tool to run on Intel MIC and it can achieve more than fivefold speedup over the original BWA while maintaining the alignment precision.

Computers , Sequence Alignment/instrumentation , Sequence Analysis, DNA/instrumentation , Software , Algorithms

10.

NGS-QCbox and Raspberry for Parallel, Automated and Rapid Quality Control Analysis of Large-Scale Next Generation Sequencing (Illumina) Data.

Katta, Mohan A V S K; Khan, Aamir W; Doddamani, Dadakhalandar; Thudi, Mahendar; Varshney, Rajeev K.

PLoS One ; 10(10): e0139868, 2015.

Article En | MEDLINE | ID: mdl-26460497

Rapid popularity and adaptation of next generation sequencing (NGS) approaches have generated huge volumes of data. High throughput platforms like Illumina HiSeq produce terabytes of raw data that requires quick processing. Quality control of the data is an important component prior to the downstream analyses. To address these issues, we have developed a quality control pipeline, NGS-QCbox that scales up to process hundreds or thousands of samples. Raspberry is an in-house tool, developed in C language utilizing HTSlib (v1.2.1) (http://htslib.org), for computing read/base level statistics. It can be used as stand-alone application and can process both compressed and uncompressed FASTQ format files. NGS-QCbox integrates Raspberry with other open-source tools for alignment (Bowtie2), SNP calling (SAMtools) and other utilities (bedtools) towards analyzing raw NGS data at higher efficiency and in high-throughput manner. The pipeline implements batch processing of jobs using Bpipe (https://github.com/ssadedin/bpipe) in parallel and internally, a fine grained task parallelization utilizing OpenMP. It reports read and base statistics along with genome coverage and variants in a user friendly format. The pipeline developed presents a simple menu driven interface and can be used in either quick or complete mode. In addition, the pipeline in quick mode outperforms in speed against other similar existing QC pipeline/tools. The NGS-QCbox pipeline, Raspberry tool and associated scripts are made available at the URL https://github.com/CEG-ICRISAT/NGS-QCbox and https://github.com/CEG-ICRISAT/Raspberry for rapid quality control analysis of large-scale next generation sequencing (Illumina) data.

High-Throughput Nucleotide Sequencing/instrumentation , Internet , Sequence Alignment/instrumentation , Sequence Alignment/methods , Software , High-Throughput Nucleotide Sequencing/methods

11.

Concurrent and Accurate Short Read Mapping on Multicore Processors.

Martínez, Héctor; Tárraga, Joaquín; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquín; Quintana-Ortí, Enrique S.

IEEE/ACM Trans Comput Biol Bioinform ; 12(5): 995-1007, 2015.

Article En | MEDLINE | ID: mdl-26451814

We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.

Chromosome Mapping/instrumentation , High-Throughput Nucleotide Sequencing/instrumentation , RNA/genetics , Sequence Analysis, RNA/instrumentation , Signal Processing, Computer-Assisted/instrumentation , Software , Base Sequence , Chromosome Mapping/methods , Equipment Design , Equipment Failure Analysis , High-Throughput Nucleotide Sequencing/methods , Molecular Sequence Data , Reproducibility of Results , Sensitivity and Specificity , Sequence Alignment/instrumentation , Sequence Alignment/methods , Sequence Analysis, RNA/methods

12.

Efficient and Accurate OTU Clustering with GPU-Based Sequence Alignment and Dynamic Dendrogram Cutting.

Nguyen, Thuy-Diem; Schmidt, Bertil; Zheng, Zejun; Kwoh, Chee-Keong.

IEEE/ACM Trans Comput Biol Bioinform ; 12(5): 1060-73, 2015.

Article En | MEDLINE | ID: mdl-26451819

De novo clustering is a popular technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). In this work, we introduce a new dendrogram-based OTU clustering pipeline called CRiSPy. The key idea used in CRiSPy to improve clustering accuracy is the application of an anomaly detection technique to obtain a dynamic distance cutoff instead of using the de facto value of 97 percent sequence similarity as in most existing OTU clustering pipelines. This technique works by detecting an abrupt change in the merging heights of a dendrogram. To produce the output dendrograms, CRiSPy employs the OTU hierarchical clustering approach that is computed on a genetic distance matrix derived from an all-against-all read comparison by pairwise sequence alignment. However, most existing dendrogram-based tools have difficulty processing datasets larger than 10,000 unique reads due to high computational complexity. We address this difficulty by developing two efficient algorithms for CRiSPy: a compute-efficient GPU-accelerated parallel algorithm for pairwise distance matrix computation and a memory-efficient hierarchical clustering algorithm. Our experiments on various datasets with distinct attributes show that CRiSPy is able to produce more accurate OTU groupings than most OTU clustering applications.

Algorithms , Computer Graphics/instrumentation , High-Throughput Nucleotide Sequencing/instrumentation , RNA, Bacterial/genetics , RNA, Ribosomal, 16S/genetics , Sequence Alignment/instrumentation , Base Sequence , Equipment Design , Equipment Failure Analysis , High-Throughput Nucleotide Sequencing/methods , Molecular Sequence Data , Pattern Recognition, Automated/methods , Sequence Alignment/methods , Signal Processing, Computer-Assisted/instrumentation

13.

Accelerating the Next Generation Long Read Mapping with the FPGA-Based System.

Chen, Peng; Wang, Chao; Li, Xi; Zhou, Xuehai.

IEEE/ACM Trans Comput Biol Bioinform ; 11(5): 840-52, 2014.

Article En | MEDLINE | ID: mdl-26356857

To compare the newly determined sequences against the subject sequences stored in the databases is a critical job in the bioinformatics. Fortunately, recent survey reports that the state-of-the-art aligners are already fast enough to handle the ultra amount of short sequence reads in the reasonable time. However, for aligning the long sequence reads (>400 bp) generated by the next generation sequencing (NGS) technology, it is still quite inefficient with present aligners. Furthermore, the challenge becomes more and more serious as the lengths and the amounts of the sequence reads are both keeping increasing with the improvement of the sequencing technology. Thus, it is extremely urgent for the researchers to enhance the performance of the long read alignment. In this paper, we propose a novel FPGA-based system to improve the efficiency of the long read mapping. Compared to the state-of-the-art long read aligner BWA-SW, our accelerating platform could achieve a high performance with almost the same sensitivity. Experiments demonstrate that, for reads with lengths ranging from 512 up to 4,096 base pairs, the described system obtains a 10x -48x speedup for the bottleneck of the software. As to the whole mapping procedure, the FPGA-based platform could achieve a 1.8x -3:3x speedup versus the BWA-SW aligner, reducing the alignment cycles from weeks to days.

Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Algorithms , Equipment Design , Genomics/instrumentation , High-Throughput Nucleotide Sequencing/instrumentation , Humans , Sequence Alignment/instrumentation , Sequence Analysis, DNA/instrumentation

14.

HAL: a hierarchical format for storing and analyzing multiple genome alignments.

Hickey, Glenn; Paten, Benedict; Earl, Dent; Zerbino, Daniel; Haussler, David.

Bioinformatics ; 29(10): 1341-2, 2013 May 15.

Article En | MEDLINE | ID: mdl-23505295

MOTIVATION: Large multiple genome alignments and inferred ancestral genomes are ideal resources for comparative studies of molecular evolution, and advances in sequencing and computing technology are making them increasingly obtainable. These structures can provide a rich understanding of the genetic relationships between all subsets of species they contain. Current formats for storing genomic alignments, such as XMFA and MAF, are all indexed or ordered using a single reference genome, however, which limits the information that can be queried with respect to other species and clades. This loss of information grows with the number of species under comparison, as well as their phylogenetic distance. RESULTS: We present HAL, a compressed, graph-based hierarchical alignment format for storing multiple genome alignments and ancestral reconstructions. HAL graphs are indexed on all genomes they contain. Furthermore, they are organized phylogenetically, which allows for modular and parallel access to arbitrary subclades without fragmentation because of rearrangements that have occurred in other lineages. HAL graphs can be created or read with a comprehensive C++ API. A set of tools is also provided to perform basic operations, such as importing and exporting data, identifying mutations and coordinate mapping (liftover). AVAILABILITY: All documentation and source code for the HAL API and tools are freely available at http://github.com/glennhickey/hal. CONTACT: hickey@soe.ucsc.edu or haussler@soe.ucsc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Genome , Sequence Alignment/methods , Software , Animals , Base Sequence , Evolution, Molecular , Genomics/methods , Humans , Phylogeny , Programming Languages , Sequence Alignment/instrumentation

15.

Fragment merger: an online tool to merge overlapping long sequence fragments.

Bell, Trevor G; Kramvis, Anna.

Viruses ; 5(3): 824-33, 2013 Mar 12.

Article En | MEDLINE | ID: mdl-23482300

While PCR amplicons extend to a few thousand bases, the length of sequences from direct Sanger sequencing is limited to 500-800 nucleotides. Therefore, several fragments may be required to cover an amplicon, a gene or an entire genome. These fragments are typically sequenced in an overlapping fashion and assembled by manually sliding and aligning the sequences visually. This is time-consuming, repetitive and error-prone, and further complicated by circular genomes. An online tool merging two to twelve long overlapping sequence fragments was developed. Either chromatograms or FASTA files are submitted to the tool, which trims poor quality ends of chromatograms according to user-specified parameters. Fragments are assembled into a single sequence by repeatedly calling the EMBOSS merger tool in a consecutive manner. Output includes the number of trimmed nucleotides, details of each merge, and an optional alignment to a reference sequence. The final merge sequence is displayed and can be downloaded in FASTA format. All output files can be downloaded as a ZIP archive. This tool allows for easy and automated assembly of overlapping sequences and is aimed at researchers without specialist computer skills. The tool is genome- and organism-agnostic and has been developed using hepatitis B virus sequence data.

Hepatitis B virus/genetics , Online Systems/instrumentation , Sequence Alignment/instrumentation , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Databases, Nucleic Acid , Hepatitis B virus/chemistry , Internet , Sequence Analysis, DNA/instrumentation , Software

16.

QuShape: rapid, accurate, and best-practices quantification of nucleic acid probing information, resolved by capillary electrophoresis.

Karabiber, Fethullah; McGinnis, Jennifer L; Favorov, Oleg V; Weeks, Kevin M.

RNA ; 19(1): 63-73, 2013 Jan.

Article En | MEDLINE | ID: mdl-23188808

Chemical probing of RNA and DNA structure is a widely used and highly informative approach for examining nucleic acid structure and for evaluating interactions with protein and small-molecule ligands. Use of capillary electrophoresis to analyze chemical probing experiments yields hundreds of nucleotides of information per experiment and can be performed on automated instruments. Extraction of the information from capillary electrophoresis electropherograms is a computationally intensive multistep analytical process, and no current software provides rapid, automated, and accurate data analysis. To overcome this bottleneck, we developed a platform-independent, user-friendly software package, QuShape, that yields quantitatively accurate nucleotide reactivity information with minimal user supervision. QuShape incorporates newly developed algorithms for signal decay correction, alignment of time-varying signals within and across capillaries and relative to the RNA nucleotide sequence, and signal scaling across channels or experiments. An analysis-by-reference option enables multiple, related experiments to be fully analyzed in minutes. We illustrate the usefulness and robustness of QuShape by analysis of RNA SHAPE (selective 2'-hydroxyl acylation analyzed by primer extension) experiments.

Electrophoresis, Capillary/methods , Nucleic Acid Probes/analysis , Software , Algorithms , DNA, Bacterial/analysis , DNA, Viral/analysis , Electrophoresis, Capillary/instrumentation , Escherichia coli/genetics , Humans , RNA, Bacterial/analysis , RNA, Viral/analysis , Sequence Alignment/instrumentation , Sequence Alignment/methods

17.

PAirwise Sequence Comparison (PASC) and its application in the classification of filoviruses.

Bao, Yiming; Chetvernin, Vyacheslav; Tatusova, Tatiana.

Viruses ; 4(8): 1318-27, 2012 08.

Article En | MEDLINE | ID: mdl-23012628

PAirwise Sequence Comparison (PASC) is a tool that uses genome sequence similarity to help with virus classification. The PASC tool at NCBI uses two methods: local alignment based on BLAST and global alignment based on Needleman-Wunsch algorithm. It works for complete genomes of viruses of several families/groups, and for the family of Filoviridae, it currently includes 52 complete genomes available in GenBank. It has been shown that BLAST-based alignment approach works better for filoviruses, and therefore is recommended for establishing taxon demarcations criteria. When more genome sequences with high divergence become available, these demarcation will most likely become more precise. The tool can compare new genome sequences of filoviruses with the ones already in the database, and propose their taxonomic classification.

Filoviridae/classification , Sequence Alignment/methods , Base Sequence , Databases, Nucleic Acid , Filoviridae/chemistry , Filoviridae/genetics , Molecular Sequence Data , Sequence Alignment/instrumentation , Sequence Analysis, DNA , Software

18.

SIMPLEX: cloud-enabled pipeline for the comprehensive analysis of exome sequencing data.

Fischer, Maria; Snajder, Rene; Pabinger, Stephan; Dander, Andreas; Schossig, Anna; Zschocke, Johannes; Trajanoski, Zlatko; Stocker, Gernot.

PLoS One ; 7(8): e41948, 2012.

Article En | MEDLINE | ID: mdl-22870267

In recent studies, exome sequencing has proven to be a successful screening tool for the identification of candidate genes causing rare genetic diseases. Although underlying targeted sequencing methods are well established, necessary data handling and focused, structured analysis still remain demanding tasks. Here, we present a cloud-enabled autonomous analysis pipeline, which comprises the complete exome analysis workflow. The pipeline combines several in-house developed and published applications to perform the following steps: (a) initial quality control, (b) intelligent data filtering and pre-processing, (c) sequence alignment to a reference genome, (d) SNP and DIP detection, (e) functional annotation of variants using different approaches, and (f) detailed report generation during various stages of the workflow. The pipeline connects the selected analysis steps, exposes all available parameters for customized usage, performs required data handling, and distributes computationally expensive tasks either on a dedicated high-performance computing infrastructure or on the Amazon cloud environment (EC2). The presented application has already been used in several research projects including studies to elucidate the role of rare genetic diseases. The pipeline is continuously tested and is publicly available under the GPL as a VirtualBox or Cloud image at http://simplex.i-med.ac.at; additional supplementary data is provided at http://www.icbi.at/exome.

Exome , Internet , Polymorphism, Single Nucleotide , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Software , Sequence Alignment/instrumentation , Sequence Analysis, DNA/instrumentation

19.

Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power.

Zhang, Yuhong; Misra, Sanchit; Agrawal, Ankit; Patwary, Md Mostofa Ali; Liao, Wei-Keng; Qin, Zhiguang; Choudhary, Alok.

BMC Bioinformatics ; 13 Suppl 5: S3, 2012 Apr 12.

Article En | MEDLINE | ID: mdl-22537007

BACKGROUND: Pairwise statistical significance has been recognized to be able to accurately identify related sequences, which is a very important cornerstone procedure in numerous bioinformatics applications. However, it is both computationally and data intensive, which poses a big challenge in terms of performance and scalability. RESULTS: We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. By carefully studying the algorithm's data access characteristics, we developed a tile-based scheme that can produce a contiguous data access in the GPU global memory and sustain a large number of threads to achieve a high GPU occupancy. We further extend the parallelization technique to estimate pairwise statistical significance using position-specific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices. The implementation is also extended to take advantage of dual-GPUs. We observe end-to-end speedups of nearly 250 (370) × using single-GPU Tesla C2050 GPU (dual-Tesla C2050) over the CPU implementation using Intel Corei7 CPU 920 processor. CONCLUSIONS: Harvesting the high performance of modern GPUs is a promising approach to accelerate pairwise statistical significance estimation for local sequence alignment.

Computer Graphics/instrumentation , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Algorithms , Sequence Alignment/instrumentation , Sequence Analysis, Protein/instrumentation , Software

20.

Protein databases on the Internet.

Xu, Dong.

Curr Protoc Mol Biol ; Chapter 19: Unit 19.4., 2012 Jan.

Article En | MEDLINE | ID: mdl-22237858

Protein databases have become a crucial part of modern biology. Huge amounts of data for protein structures, functions, and particularly sequences are being generated. Searching databases is often the first step in the study of a new protein. Comparison between proteins or between protein families provides information about the relationship between proteins within a genome or across different species, and hence offers much more information than can be obtained by studying only an isolated protein. In addition, secondary databases derived from experimental databases are also widely available. These databases reorganize and annotate the data or provide predictions. The use of multiple databases often helps researchers understand the structure and function of a protein. Although some protein databases are widely known, they are far from being fully utilized in the protein science community. This unit provides a starting point for readers to explore the potential of protein databases on the Internet.

Computational Biology/methods , Databases, Protein , Proteins/chemistry , Sequence Alignment/methods , Amino Acid Sequence , Animals , Computational Biology/instrumentation , Humans , Internet , Proteins/genetics , Sequence Alignment/instrumentation