Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 66
Filtrar
1.
Methods Mol Biol ; 2231: 39-47, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33289885

RESUMO

Multiple sequence alignment (MSA) is a central step in many bioinformatics and computational biology analyses. Although there exist many methods to perform MSA, most of them fail when dealing with large datasets due to their high computational cost. MSAProbs-MPI is a publicly available tool ( http://msaprobs.sourceforge.net ) that provides highly accurate results in relatively short runtime thanks to exploiting the hardware resources of multicore clusters. In this chapter, I explain the statistical and biological concepts employed in MSAProbs-MPI to complete the alignments, as well as the high-performance computing techniques used to accelerate it. Moreover, I provide some hints about the configuration parameters that should be used to guarantee high-performance executions.


Assuntos
Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Biologia Computacional/instrumentação , Metodologias Computacionais , Alinhamento de Sequência/instrumentação
2.
Methods Mol Biol ; 2231: 89-97, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33289888

RESUMO

Many fields of biology rely on the inference of accurate multiple sequence alignments (MSA) of biological sequences. Unfortunately, the problem of assembling an MSA is NP-complete thus limiting computation to approximate solutions using heuristics solutions. The progressive algorithm is one of the most popular frameworks for the computation of MSAs. It involves pre-clustering the sequences and aligning them starting with the most similar ones. The scalability of this framework is limited, especially with respect to accuracy. We present here an alternative approach named regressive algorithm. In this framework, sequences are first clustered and then aligned starting with the most distantly related ones. This approach has been shown to greatly improve accuracy during scale-up, especially on datasets featuring 10,000 sequences or more. Another benefit is the possibility to integrate third-party clustering methods and third-party MSA aligners. The regressive algorithm has been tested on up to 1.5 million sequences, its implementation is available in the T-Coffee package.


Assuntos
Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Análise por Conglomerados , Biologia Computacional/instrumentação , Alinhamento de Sequência/instrumentação
3.
Methods Mol Biol ; 2231: 179-200, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33289894

RESUMO

Bioinformatic analysis of functionally diverse superfamilies can help to study the structure-function relationship in proteins, but represents a methodological challenge. The Mustguseal web-server can build large structure-guided sequence alignments of thousands of homologs that cover all currently available sequence variants within a common structural fold. The input to the method is a PDB code of the query protein, which represents the protein superfamily of interest. The collection and subsequent alignment of protein sequences and structures is fully automated and driven by the particular choice of parameters. Four integrated sister web-methods-the Zebra, pocketZebra, visualCMAT, and Yosshi-are available to further analyze the resulting superimposition and identify conserved, subfamily-specific, and co-evolving residues, as well as to classify and study disulfide bonds in protein superfamilies. The integration of these web-based bioinformatic tools provides an out-of-the-box easy-to-use solution, first of its kind, to study protein function and regulation and design improved enzyme variants for practical applications and selective ligands to modulate their functional properties. In this chapter, we provide a step-by-step protocol for a comprehensive bioinformatic analysis of a protein superfamily using a web-browser as the main tool and notes on selecting the appropriate values for the key algorithm parameters depending on your research objective. The web-servers are freely available to all users at https://biokinet.belozersky.msu.ru/m-platform with no login requirement.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Alinhamento de Sequência/métodos , Software , Algoritmos , Sequência de Aminoácidos , Biologia Computacional/instrumentação , Dissulfetos/química , Internet , Ligantes , Estrutura Terciária de Proteína , Alinhamento de Sequência/instrumentação
4.
IEEE/ACM Trans Comput Biol Bioinform ; 17(4): 1093-1104, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-30530369

RESUMO

FM-index is a compact data structure suitable for fast matches of short reads to large reference genomes. The matching algorithm using this index exhibits irregular memory access patterns that cause frequent cache misses, resulting in a memory bound problem. This paper analyzes different FM-index versions presented in the literature, focusing on those computing aspects related to the data access. As a result of the analysis, we propose a new organization of FM-index that minimizes the demand for memory bandwidth, allowing a great improvement of performance on processors with high-bandwidth memory, such as the second-generation Intel Xeon Phi (Knights Landing, or KNL), integrating ultra high-bandwidth stacked memory technology. As the roofline model shows, our implementation reaches 95 percent of the peak random access bandwidth limit when executed on the KNL and almost all of the available bandwidth when executed on other Intel Xeon architectures with conventional DDR memory. In addition, the obtained throughput in KNL is much higher than the results reported for GPUs in the literature.


Assuntos
Genômica , Alinhamento de Sequência , Algoritmos , Computadores , DNA/genética , Genoma Humano/genética , Genômica/instrumentação , Genômica/métodos , Humanos , Alinhamento de Sequência/instrumentação , Alinhamento de Sequência/métodos
5.
IEEE Trans Biomed Circuits Syst ; 13(6): 1771-1782, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31581096

RESUMO

In this study, we design a hardware accelerator for a widely used sequence alignment algorithm, the basic local alignment search tool for proteins (BLASTP). The architecture of the proposed accelerator consists of five stages: a new systolic-array-based one-hit finding stage, a novel RAM-REG-based two-hit finding stage, a refined ungapped extension stage, a faster gapped extension stage, and a highly efficient parallel sorter. The system is implemented on an Altera Stratix V FPGA with a processing speed of more than 500 giga cell updates per second (GCUPS). It can receive a query sequence, compare it with the sequences in the database, and generate a list sorted in descending order of the similarity scores between the query sequence and the subject sequences. Moreover, it is capable of processing both query and subject protein sequences comprising as many as 8192 amino acid residues in a single pass. Using data from the National Center for Biotechnology Information (NCBI) database, we show that a speed-up of more than 3X can be achieved with our hardware compared to the runtime required by BLASTP software on an 8-thread Intel Xeon CPU with 144 GB DRAM.


Assuntos
Proteínas/genética , Alinhamento de Sequência/instrumentação , Sequência de Aminoácidos , Bases de Dados Factuais , Desenho de Equipamento , Alinhamento de Sequência/métodos
6.
Nucleic Acids Res ; 46(W1): W25-W29, 2018 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-29788132

RESUMO

The Freiburg RNA tools webserver is a well established online resource for RNA-focused research. It provides a unified user interface and comprehensive result visualization for efficient command line tools. The webserver includes RNA-RNA interaction prediction (IntaRNA, CopraRNA, metaMIR), sRNA homology search (GLASSgo), sequence-structure alignments (LocARNA, MARNA, CARNA, ExpaRNA), CRISPR repeat classification (CRISPRmap), sequence design (antaRNA, INFO-RNA, SECISDesign), structure aberration evaluation of point mutations (RaSE), and RNA/protein-family models visualization (CMV), and other methods. Open education resources offer interactive visualizations of RNA structure and RNA-RNA interaction prediction as well as basic and advanced sequence alignment algorithms. The services are freely available at http://rna.informatik.uni-freiburg.de.


Assuntos
Sequência de Bases/genética , Internet , RNA/genética , Software , Algoritmos , Conformação de Ácido Nucleico , RNA/química , Alinhamento de Sequência/instrumentação , Análise de Sequência de RNA/instrumentação , Relação Estrutura-Atividade
7.
J Biotechnol ; 257: 58-60, 2017 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-28232083

RESUMO

The introduction of next generation sequencing has caused a steady increase in the amounts of data that have to be processed in modern life science. Sequence alignment plays a key role in the analysis of sequencing data e.g. within whole genome sequencing or metagenome projects. BLAST is a commonly used alignment tool that was the standard approach for more than two decades, but in the last years faster alternatives have been proposed including RapSearch, GHOSTX, and DIAMOND. Here we introduce HAMOND, an application that uses Apache Hadoop to parallelize DIAMOND computation in order to scale-out the calculation of alignments. HAMOND is fault tolerant and scalable by utilizing large cloud computing infrastructures like Amazon Web Services. HAMOND has been tested in comparative genomics analyses and showed promising results both in efficiency and accuracy.


Assuntos
Computação em Nuvem , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Hibridização Genômica Comparativa , Biologia Computacional , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/instrumentação , Internet , Metagenoma , Dados de Sequência Molecular , Alinhamento de Sequência/instrumentação , Análise de Sequência de DNA/instrumentação , Software , Sequenciamento Completo do Genoma
8.
Nat Methods ; 13(9): 777-83, 2016 09.
Artigo em Inglês | MEDLINE | ID: mdl-27479329

RESUMO

Next-generation mass spectrometric (MS) techniques such as SWATH-MS have substantially increased the throughput and reproducibility of proteomic analysis, but ensuring consistent quantification of thousands of peptide analytes across multiple liquid chromatography-tandem MS (LC-MS/MS) runs remains a challenging and laborious manual process. To produce highly consistent and quantitatively accurate proteomics data matrices in an automated fashion, we developed TRIC (http://proteomics.ethz.ch/tric/), a software tool that utilizes fragment-ion data to perform cross-run alignment, consistent peak-picking and quantification for high-throughput targeted proteomics. TRIC reduced the identification error compared to a state-of-the-art SWATH-MS analysis without alignment by more than threefold at constant recall while correcting for highly nonlinear chromatographic effects. On a pulsed-SILAC experiment performed on human induced pluripotent stem cells, TRIC was able to automatically align and quantify thousands of light and heavy isotopic peak groups. Thus, TRIC fills a gap in the pipeline for automated analysis of massively parallel targeted proteomics data sets.


Assuntos
Processamento Eletrônico de Dados/métodos , Peptídeos/análise , Proteômica/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Software , Algoritmos , Processamento Eletrônico de Dados/instrumentação , Humanos , Espectrometria de Massas , Peptídeos/metabolismo , Células-Tronco Pluripotentes/metabolismo , Precursores de Proteínas/análise , Precursores de Proteínas/metabolismo , Proteólise , Proteômica/instrumentação , Reprodutibilidade dos Testes , Alinhamento de Sequência/instrumentação , Análise de Sequência de Proteína/instrumentação , Streptococcus pyogenes/metabolismo
9.
Interdiscip Sci ; 8(1): 28-34, 2016 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-26358141

RESUMO

Sequence alignment is the central process for sequence analysis, where mapping raw sequencing data to reference genome. The large amount of data generated by NGS is far beyond the process capabilities of existing alignment tools. Consequently, sequence alignment becomes the bottleneck of sequence analysis. Intensive computing power is required to address this challenge. Intel recently announced the MIC coprocessor, which can provide massive computing power. The Tianhe-2 is the world's fastest supercomputer now equipped with three MIC coprocessors each compute node. A key feature of sequence alignment is that different reads are independent. Considering this property, we proposed a MIC-oriented three-level parallelization strategy to speed up BWA, a widely used sequence alignment tool, and developed our ultrafast parallel sequence aligner: B-MIC. B-MIC contains three levels of parallelization: firstly, parallelization of data IO and reads alignment by a three-stage parallel pipeline; secondly, parallelization enabled by MIC coprocessor technology; thirdly, inter-node parallelization implemented by MPI. In this paper, we demonstrate that B-MIC outperforms BWA by a combination of those techniques using Inspur NF5280M server and the Tianhe-2 supercomputer. To the best of our knowledge, B-MIC is the first sequence alignment tool to run on Intel MIC and it can achieve more than fivefold speedup over the original BWA while maintaining the alignment precision.


Assuntos
Computadores , Alinhamento de Sequência/instrumentação , Análise de Sequência de DNA/instrumentação , Software , Algoritmos
10.
PLoS One ; 10(10): e0139868, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26460497

RESUMO

Rapid popularity and adaptation of next generation sequencing (NGS) approaches have generated huge volumes of data. High throughput platforms like Illumina HiSeq produce terabytes of raw data that requires quick processing. Quality control of the data is an important component prior to the downstream analyses. To address these issues, we have developed a quality control pipeline, NGS-QCbox that scales up to process hundreds or thousands of samples. Raspberry is an in-house tool, developed in C language utilizing HTSlib (v1.2.1) (http://htslib.org), for computing read/base level statistics. It can be used as stand-alone application and can process both compressed and uncompressed FASTQ format files. NGS-QCbox integrates Raspberry with other open-source tools for alignment (Bowtie2), SNP calling (SAMtools) and other utilities (bedtools) towards analyzing raw NGS data at higher efficiency and in high-throughput manner. The pipeline implements batch processing of jobs using Bpipe (https://github.com/ssadedin/bpipe) in parallel and internally, a fine grained task parallelization utilizing OpenMP. It reports read and base statistics along with genome coverage and variants in a user friendly format. The pipeline developed presents a simple menu driven interface and can be used in either quick or complete mode. In addition, the pipeline in quick mode outperforms in speed against other similar existing QC pipeline/tools. The NGS-QCbox pipeline, Raspberry tool and associated scripts are made available at the URL https://github.com/CEG-ICRISAT/NGS-QCbox and https://github.com/CEG-ICRISAT/Raspberry for rapid quality control analysis of large-scale next generation sequencing (Illumina) data.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/instrumentação , Internet , Alinhamento de Sequência/instrumentação , Alinhamento de Sequência/métodos , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos
11.
Artigo em Inglês | MEDLINE | ID: mdl-26451814

RESUMO

We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.


Assuntos
Mapeamento Cromossômico/instrumentação , Sequenciamento de Nucleotídeos em Larga Escala/instrumentação , RNA/genética , Análise de Sequência de RNA/instrumentação , Processamento de Sinais Assistido por Computador/instrumentação , Software , Sequência de Bases , Mapeamento Cromossômico/métodos , Desenho de Equipamento , Análise de Falha de Equipamento , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Dados de Sequência Molecular , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Alinhamento de Sequência/instrumentação , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos
12.
Artigo em Inglês | MEDLINE | ID: mdl-26451819

RESUMO

De novo clustering is a popular technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). In this work, we introduce a new dendrogram-based OTU clustering pipeline called CRiSPy. The key idea used in CRiSPy to improve clustering accuracy is the application of an anomaly detection technique to obtain a dynamic distance cutoff instead of using the de facto value of 97 percent sequence similarity as in most existing OTU clustering pipelines. This technique works by detecting an abrupt change in the merging heights of a dendrogram. To produce the output dendrograms, CRiSPy employs the OTU hierarchical clustering approach that is computed on a genetic distance matrix derived from an all-against-all read comparison by pairwise sequence alignment. However, most existing dendrogram-based tools have difficulty processing datasets larger than 10,000 unique reads due to high computational complexity. We address this difficulty by developing two efficient algorithms for CRiSPy: a compute-efficient GPU-accelerated parallel algorithm for pairwise distance matrix computation and a memory-efficient hierarchical clustering algorithm. Our experiments on various datasets with distinct attributes show that CRiSPy is able to produce more accurate OTU groupings than most OTU clustering applications.


Assuntos
Algoritmos , Gráficos por Computador/instrumentação , Sequenciamento de Nucleotídeos em Larga Escala/instrumentação , RNA Bacteriano/genética , RNA Ribossômico 16S/genética , Alinhamento de Sequência/instrumentação , Sequência de Bases , Desenho de Equipamento , Análise de Falha de Equipamento , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Dados de Sequência Molecular , Reconhecimento Automatizado de Padrão/métodos , Alinhamento de Sequência/métodos , Processamento de Sinais Assistido por Computador/instrumentação
13.
Artigo em Inglês | MEDLINE | ID: mdl-26356857

RESUMO

To compare the newly determined sequences against the subject sequences stored in the databases is a critical job in the bioinformatics. Fortunately, recent survey reports that the state-of-the-art aligners are already fast enough to handle the ultra amount of short sequence reads in the reasonable time. However, for aligning the long sequence reads (>400 bp) generated by the next generation sequencing (NGS) technology, it is still quite inefficient with present aligners. Furthermore, the challenge becomes more and more serious as the lengths and the amounts of the sequence reads are both keeping increasing with the improvement of the sequencing technology. Thus, it is extremely urgent for the researchers to enhance the performance of the long read alignment. In this paper, we propose a novel FPGA-based system to improve the efficiency of the long read mapping. Compared to the state-of-the-art long read aligner BWA-SW, our accelerating platform could achieve a high performance with almost the same sensitivity. Experiments demonstrate that, for reads with lengths ranging from 512 up to 4,096 base pairs, the described system obtains a 10x -48x speedup for the bottleneck of the software. As to the whole mapping procedure, the FPGA-based platform could achieve a 1.8x -3:3x speedup versus the BWA-SW aligner, reducing the alignment cycles from weeks to days.


Assuntos
Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Desenho de Equipamento , Genômica/instrumentação , Sequenciamento de Nucleotídeos em Larga Escala/instrumentação , Humanos , Alinhamento de Sequência/instrumentação , Análise de Sequência de DNA/instrumentação
14.
Viruses ; 5(3): 824-33, 2013 Mar 12.
Artigo em Inglês | MEDLINE | ID: mdl-23482300

RESUMO

While PCR amplicons extend to a few thousand bases, the length of sequences from direct Sanger sequencing is limited to 500-800 nucleotides. Therefore, several fragments may be required to cover an amplicon, a gene or an entire genome. These fragments are typically sequenced in an overlapping fashion and assembled by manually sliding and aligning the sequences visually. This is time-consuming, repetitive and error-prone, and further complicated by circular genomes. An online tool merging two to twelve long overlapping sequence fragments was developed. Either chromatograms or FASTA files are submitted to the tool, which trims poor quality ends of chromatograms according to user-specified parameters. Fragments are assembled into a single sequence by repeatedly calling the EMBOSS merger tool in a consecutive manner. Output includes the number of trimmed nucleotides, details of each merge, and an optional alignment to a reference sequence. The final merge sequence is displayed and can be downloaded in FASTA format. All output files can be downloaded as a ZIP archive. This tool allows for easy and automated assembly of overlapping sequences and is aimed at researchers without specialist computer skills. The tool is genome- and organism-agnostic and has been developed using hepatitis B virus sequence data.


Assuntos
Vírus da Hepatite B/genética , Sistemas On-Line/instrumentação , Alinhamento de Sequência/instrumentação , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Bases de Dados de Ácidos Nucleicos , Vírus da Hepatite B/química , Internet , Análise de Sequência de DNA/instrumentação , Software
15.
Bioinformatics ; 29(10): 1341-2, 2013 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-23505295

RESUMO

MOTIVATION: Large multiple genome alignments and inferred ancestral genomes are ideal resources for comparative studies of molecular evolution, and advances in sequencing and computing technology are making them increasingly obtainable. These structures can provide a rich understanding of the genetic relationships between all subsets of species they contain. Current formats for storing genomic alignments, such as XMFA and MAF, are all indexed or ordered using a single reference genome, however, which limits the information that can be queried with respect to other species and clades. This loss of information grows with the number of species under comparison, as well as their phylogenetic distance. RESULTS: We present HAL, a compressed, graph-based hierarchical alignment format for storing multiple genome alignments and ancestral reconstructions. HAL graphs are indexed on all genomes they contain. Furthermore, they are organized phylogenetically, which allows for modular and parallel access to arbitrary subclades without fragmentation because of rearrangements that have occurred in other lineages. HAL graphs can be created or read with a comprehensive C++ API. A set of tools is also provided to perform basic operations, such as importing and exporting data, identifying mutations and coordinate mapping (liftover). AVAILABILITY: All documentation and source code for the HAL API and tools are freely available at http://github.com/glennhickey/hal. CONTACT: hickey@soe.ucsc.edu or haussler@soe.ucsc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Alinhamento de Sequência/métodos , Software , Animais , Sequência de Bases , Evolução Molecular , Genômica/métodos , Humanos , Filogenia , Linguagens de Programação , Alinhamento de Sequência/instrumentação
16.
RNA ; 19(1): 63-73, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-23188808

RESUMO

Chemical probing of RNA and DNA structure is a widely used and highly informative approach for examining nucleic acid structure and for evaluating interactions with protein and small-molecule ligands. Use of capillary electrophoresis to analyze chemical probing experiments yields hundreds of nucleotides of information per experiment and can be performed on automated instruments. Extraction of the information from capillary electrophoresis electropherograms is a computationally intensive multistep analytical process, and no current software provides rapid, automated, and accurate data analysis. To overcome this bottleneck, we developed a platform-independent, user-friendly software package, QuShape, that yields quantitatively accurate nucleotide reactivity information with minimal user supervision. QuShape incorporates newly developed algorithms for signal decay correction, alignment of time-varying signals within and across capillaries and relative to the RNA nucleotide sequence, and signal scaling across channels or experiments. An analysis-by-reference option enables multiple, related experiments to be fully analyzed in minutes. We illustrate the usefulness and robustness of QuShape by analysis of RNA SHAPE (selective 2'-hydroxyl acylation analyzed by primer extension) experiments.


Assuntos
Eletroforese Capilar/métodos , Sondas de Ácido Nucleico/análise , Software , Algoritmos , DNA Bacteriano/análise , DNA Viral/análise , Eletroforese Capilar/instrumentação , Escherichia coli/genética , Humanos , RNA Bacteriano/análise , RNA Viral/análise , Alinhamento de Sequência/instrumentação , Alinhamento de Sequência/métodos
17.
Viruses ; 4(8): 1318-27, 2012 08.
Artigo em Inglês | MEDLINE | ID: mdl-23012628

RESUMO

PAirwise Sequence Comparison (PASC) is a tool that uses genome sequence similarity to help with virus classification. The PASC tool at NCBI uses two methods: local alignment based on BLAST and global alignment based on Needleman-Wunsch algorithm. It works for complete genomes of viruses of several families/groups, and for the family of Filoviridae, it currently includes 52 complete genomes available in GenBank. It has been shown that BLAST-based alignment approach works better for filoviruses, and therefore is recommended for establishing taxon demarcations criteria. When more genome sequences with high divergence become available, these demarcation will most likely become more precise. The tool can compare new genome sequences of filoviruses with the ones already in the database, and propose their taxonomic classification.


Assuntos
Filoviridae/classificação , Alinhamento de Sequência/métodos , Sequência de Bases , Bases de Dados de Ácidos Nucleicos , Filoviridae/química , Filoviridae/genética , Dados de Sequência Molecular , Alinhamento de Sequência/instrumentação , Análise de Sequência de DNA , Software
18.
PLoS One ; 7(8): e41948, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22870267

RESUMO

In recent studies, exome sequencing has proven to be a successful screening tool for the identification of candidate genes causing rare genetic diseases. Although underlying targeted sequencing methods are well established, necessary data handling and focused, structured analysis still remain demanding tasks. Here, we present a cloud-enabled autonomous analysis pipeline, which comprises the complete exome analysis workflow. The pipeline combines several in-house developed and published applications to perform the following steps: (a) initial quality control, (b) intelligent data filtering and pre-processing, (c) sequence alignment to a reference genome, (d) SNP and DIP detection, (e) functional annotation of variants using different approaches, and (f) detailed report generation during various stages of the workflow. The pipeline connects the selected analysis steps, exposes all available parameters for customized usage, performs required data handling, and distributes computationally expensive tasks either on a dedicated high-performance computing infrastructure or on the Amazon cloud environment (EC2). The presented application has already been used in several research projects including studies to elucidate the role of rare genetic diseases. The pipeline is continuously tested and is publicly available under the GPL as a VirtualBox or Cloud image at http://simplex.i-med.ac.at; additional supplementary data is provided at http://www.icbi.at/exome.


Assuntos
Exoma , Internet , Polimorfismo de Nucleotídeo Único , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Alinhamento de Sequência/instrumentação , Análise de Sequência de DNA/instrumentação
19.
BMC Bioinformatics ; 13 Suppl 5: S3, 2012 Apr 12.
Artigo em Inglês | MEDLINE | ID: mdl-22537007

RESUMO

BACKGROUND: Pairwise statistical significance has been recognized to be able to accurately identify related sequences, which is a very important cornerstone procedure in numerous bioinformatics applications. However, it is both computationally and data intensive, which poses a big challenge in terms of performance and scalability. RESULTS: We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. By carefully studying the algorithm's data access characteristics, we developed a tile-based scheme that can produce a contiguous data access in the GPU global memory and sustain a large number of threads to achieve a high GPU occupancy. We further extend the parallelization technique to estimate pairwise statistical significance using position-specific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices. The implementation is also extended to take advantage of dual-GPUs. We observe end-to-end speedups of nearly 250 (370) × using single-GPU Tesla C2050 GPU (dual-Tesla C2050) over the CPU implementation using Intel Corei7 CPU 920 processor. CONCLUSIONS: Harvesting the high performance of modern GPUs is a promising approach to accelerate pairwise statistical significance estimation for local sequence alignment.


Assuntos
Gráficos por Computador/instrumentação , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Alinhamento de Sequência/instrumentação , Análise de Sequência de Proteína/instrumentação , Software
20.
Curr Protoc Mol Biol ; Chapter 19: Unit 19.4., 2012 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22237858

RESUMO

Protein databases have become a crucial part of modern biology. Huge amounts of data for protein structures, functions, and particularly sequences are being generated. Searching databases is often the first step in the study of a new protein. Comparison between proteins or between protein families provides information about the relationship between proteins within a genome or across different species, and hence offers much more information than can be obtained by studying only an isolated protein. In addition, secondary databases derived from experimental databases are also widely available. These databases reorganize and annotate the data or provide predictions. The use of multiple databases often helps researchers understand the structure and function of a protein. Although some protein databases are widely known, they are far from being fully utilized in the protein science community. This unit provides a starting point for readers to explore the potential of protein databases on the Internet.


Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Proteínas/química , Alinhamento de Sequência/métodos , Sequência de Aminoácidos , Animais , Biologia Computacional/instrumentação , Humanos , Internet , Proteínas/genética , Alinhamento de Sequência/instrumentação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA