Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
1.
Nat Biotechnol ; 40(5): 672-680, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35132260

RESUMO

The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.


Assuntos
Genoma Humano , Genoma Humano/genética , Haplótipos/genética , Humanos , Análise de Sequência de DNA
2.
Genome Biol ; 23(1): 12, 2022 01 07.
Artigo em Inglês | MEDLINE | ID: mdl-34996510

RESUMO

BACKGROUND: Accurate detection of somatic mutations is challenging but critical in understanding cancer formation, progression, and treatment. We recently proposed NeuSomatic, the first deep convolutional neural network-based somatic mutation detection approach, and demonstrated performance advantages on in silico data. RESULTS: In this study, we use the first comprehensive and well-characterized somatic reference data sets from the SEQC2 consortium to investigate best practices for using a deep learning framework in cancer mutation detection. Using the high-confidence somatic mutations established for a cancer cell line by the consortium, we identify the best strategy for building robust models on multiple data sets derived from samples representing real scenarios, for example, a model trained on a combination of real and spike-in mutations had the highest average performance. CONCLUSIONS: The strategy identified in our study achieved high robustness across multiple sequencing technologies for fresh and FFPE DNA input, varying tumor/normal purities, and different coverages, with significant superiority over conventional detection approaches in general, as well as in challenging situations such as low coverage, low variant allele frequency, DNA damage, and difficult genomic regions.


Assuntos
Aprendizado Profundo , Neoplasias , Genômica , Humanos , Mutação , Neoplasias/genética , Redes Neurais de Computação
3.
Genome Biol ; 23(1): 2, 2022 01 03.
Artigo em Inglês | MEDLINE | ID: mdl-34980216

RESUMO

BACKGROUND: Reproducible detection of inherited variants with whole genome sequencing (WGS) is vital for the implementation of precision medicine and is a complicated process in which each step affects variant call quality. Systematically assessing reproducibility of inherited variants with WGS and impact of each step in the process is needed for understanding and improving quality of inherited variants from WGS. RESULTS: To dissect the impact of factors involved in detection of inherited variants with WGS, we sequence triplicates of eight DNA samples representing two populations on three short-read sequencing platforms using three library kits in six labs and call variants with 56 combinations of aligners and callers. We find that bioinformatics pipelines (callers and aligners) have a larger impact on variant reproducibility than WGS platform or library preparation. Single-nucleotide variants (SNVs), particularly outside difficult-to-map regions, are more reproducible than small insertions and deletions (indels), which are least reproducible when > 5 bp. Increasing sequencing coverage improves indel reproducibility but has limited impact on SNVs above 30×. CONCLUSIONS: Our findings highlight sources of variability in variant detection and the need for improvement of bioinformatics pipelines in the era of precision medicine with WGS.


Assuntos
Genoma Humano , Polimorfismo de Nucleotídeo Único , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Mutação INDEL , Reprodutibilidade dos Testes , Sequenciamento Completo do Genoma
4.
Genome Biol ; 22(1): 347, 2021 12 20.
Artigo em Inglês | MEDLINE | ID: mdl-34930391

RESUMO

BACKGROUND: Genomic structural variations (SV) are important determinants of genotypic and phenotypic changes in many organisms. However, the detection of SV from next-generation sequencing data remains challenging. RESULTS: In this study, DNA from a Chinese family quartet is sequenced at three different sequencing centers in triplicate. A total of 288 derivative data sets are generated utilizing different analysis pipelines and compared to identify sources of analytical variability. Mapping methods provide the major contribution to variability, followed by sequencing centers and replicates. Interestingly, SV supported by only one center or replicate often represent true positives with 47.02% and 45.44% overlapping the long-read SV call set, respectively. This is consistent with an overall higher false negative rate for SV calling in centers and replicates compared to mappers (15.72%). Finally, we observe that the SV calling variability also persists in a genotyping approach, indicating the impact of the underlying sequencing and preparation approaches. CONCLUSIONS: This study provides the first detailed insights into the sources of variability in SV identification from next-generation sequencing and highlights remaining challenges in SV calling for large cohorts. We further give recommendations on how to reduce SV calling variability and the choice of alignment methodology.


Assuntos
Variação Estrutural do Genoma , Genômica/métodos , Células Germinativas , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequência de Bases , Viés , Mapeamento Cromossômico , Análise de Sequência de DNA
5.
Nat Biotechnol ; 39(9): 1151-1160, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-34504347

RESUMO

The lack of samples for generating standardized DNA datasets for setting up a sequencing pipeline or benchmarking the performance of different algorithms limits the implementation and uptake of cancer genomics. Here, we describe reference call sets obtained from paired tumor-normal genomic DNA (gDNA) samples derived from a breast cancer cell line-which is highly heterogeneous, with an aneuploid genome, and enriched in somatic alterations-and a matched lymphoblastoid cell line. We partially validated both somatic mutations and germline variants in these call sets via whole-exome sequencing (WES) with different sequencing platforms and targeted sequencing with >2,000-fold coverage, spanning 82% of genomic regions with high confidence. Although the gDNA reference samples are not representative of primary cancer cells from a clinical sample, when setting up a sequencing pipeline, they not only minimize potential biases from technologies, assays and informatics but also provide a unique resource for benchmarking 'tumor-only' or 'matched tumor-normal' analyses.


Assuntos
Benchmarking , Neoplasias da Mama/genética , Análise Mutacional de DNA/normas , Sequenciamento de Nucleotídeos em Larga Escala/normas , Sequenciamento Completo do Genoma/normas , Linhagem Celular Tumoral , Conjuntos de Dados como Assunto , Células Germinativas , Humanos , Mutação , Padrões de Referência , Reprodutibilidade dos Testes
6.
Genome Biol ; 22(1): 111, 2021 04 16.
Artigo em Inglês | MEDLINE | ID: mdl-33863366

RESUMO

BACKGROUND: Oncopanel genomic testing, which identifies important somatic variants, is increasingly common in medical practice and especially in clinical trials. Currently, there is a paucity of reliable genomic reference samples having a suitably large number of pre-identified variants for properly assessing oncopanel assay analytical quality and performance. The FDA-led Sequencing and Quality Control Phase 2 (SEQC2) consortium analyze ten diverse cancer cell lines individually and their pool, termed Sample A, to develop a reference sample with suitably large numbers of coding positions with known (variant) positives and negatives for properly evaluating oncopanel analytical performance. RESULTS: In reference Sample A, we identify more than 40,000 variants down to 1% allele frequency with more than 25,000 variants having less than 20% allele frequency with 1653 variants in COSMIC-related genes. This is 5-100× more than existing commercially available samples. We also identify an unprecedented number of negative positions in coding regions, allowing statistical rigor in assessing limit-of-detection, sensitivity, and precision. Over 300 loci are randomly selected and independently verified via droplet digital PCR with 100% concordance. Agilent normal reference Sample B can be admixed with Sample A to create new samples with a similar number of known variants at much lower allele frequency than what exists in Sample A natively, including known variants having allele frequency of 0.02%, a range suitable for assessing liquid biopsy panels. CONCLUSION: These new reference samples and their admixtures provide superior capability for performing oncopanel quality control, analytical accuracy, and validation for small to large oncopanels and liquid biopsy assays.


Assuntos
Alelos , Biomarcadores Tumorais , Frequência do Gene , Testes Genéticos/métodos , Variação Genética , Genômica/métodos , Neoplasias/genética , Linhagem Celular Tumoral , Variações do Número de Cópias de DNA , Heterogeneidade Genética , Testes Genéticos/normas , Genômica/normas , Humanos , Neoplasias/diagnóstico , Fluxo de Trabalho
7.
Nat Commun ; 10(1): 1041, 2019 03 04.
Artigo em Inglês | MEDLINE | ID: mdl-30833567

RESUMO

Accurate detection of somatic mutations is still a challenge in cancer analysis. Here we present NeuSomatic, the first convolutional neural network approach for somatic mutation detection, which significantly outperforms previous methods on different sequencing platforms, sequencing strategies, and tumor purities. NeuSomatic summarizes sequence alignments into small matrices and incorporates more than a hundred features to capture mutation signals effectively. It can be used universally as a stand-alone somatic mutation detection method or with an ensemble of existing methods to achieve the highest accuracy.


Assuntos
Biologia Computacional/métodos , Análise Mutacional de DNA/métodos , Aprendizado de Máquina , Mutação , Redes Neurais de Computação , Biologia Computacional/instrumentação , Análise Mutacional de DNA/instrumentação , Bases de Dados Genéticas , Diploide , Exoma , Genes Neoplásicos , Humanos , Neoplasias/genética , Alinhamento de Sequência , Análise de Sequência de DNA/instrumentação , Análise de Sequência de DNA/métodos
8.
Nat Commun ; 8(1): 59, 2017 07 05.
Artigo em Inglês | MEDLINE | ID: mdl-28680106

RESUMO

RNA-sequencing (RNA-seq) is an essential technique for transcriptome studies, hundreds of analysis tools have been developed since it was debuted. Although recent efforts have attempted to assess the latest available tools, they have not evaluated the analysis workflows comprehensively to unleash the power within RNA-seq. Here we conduct an extensive study analysing a broad spectrum of RNA-seq workflows. Surpassing the expression analysis scope, our work also includes assessment of RNA variant-calling, RNA editing and RNA fusion detection techniques. Specifically, we examine both short- and long-read RNA-seq technologies, 39 analysis tools resulting in ~120 combinations, and ~490 analyses involving 15 samples with a variety of germline, cancer and stem cell data sets. We report the performance and propose a comprehensive RNA-seq analysis protocol, named RNACocktail, along with a computational pipeline achieving high accuracy. Validation on different samples reveals that our proposed protocol could help researchers extract more biologically relevant predictions by broad analysis of the transcriptome.RNA-seq is widely used for transcriptome analysis. Here, the authors analyse a wide spectrum of RNA-seq workflows and present a comprehensive analysis protocol named RNACocktail as well as a computational pipeline leveraging the widely used tools for accurate RNA-seq analysis.


Assuntos
Células-Tronco Embrionárias , Transcriptoma , Sequência de Bases , Linhagem Celular , Humanos
9.
Methods Mol Biol ; 1079: 203-10, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24170404

RESUMO

PicXAA is a probabilistic nonprogressive alignment algorithm that finds protein (or DNA) multiple sequence alignments with maximum expected accuracy. PicXAA greedily builds up the alignment from sequence regions with high local similarity, thereby yielding an accurate global alignment that effectively captures the local similarities across sequences. PicXAA constantly yields accurate alignment results on a wide range of reference sets that have different characteristics, with especially remarkable improvements over other leading algorithms on sequence sets with high local similarities. In this chapter, we describe the overall alignment strategy used in PicXAA and discuss several important considerations for effective deployment of the algorithm.


Assuntos
Algoritmos , Biologia Computacional/métodos , Alinhamento de Sequência/métodos , DNA/genética , Probabilidade , Proteínas/química , Controle de Qualidade
10.
BMC Genomics ; 14: 448, 2013 Jul 05.
Artigo em Inglês | MEDLINE | ID: mdl-23829350

RESUMO

BACKGROUND: Rapid acquisition of accurate genotyping information is essential for all genetic marker-based studies. For species with relatively small genomes, complete genome resequencing is a feasible approach for genotyping; however, for species with large and highly repetitive genomes, the acquisition of whole genome sequences for the purpose of genotyping is still relatively inefficient and too expensive to be carried out on a high-throughput basis. Sorghum bicolor is a C4 grass with a sequenced genome size of ~730 Mb, of which ~80% is highly repetitive. We have developed a restriction enzyme targeted genome resequencing method for genetic analysis, termed Digital Genotyping (DG), to be applied to sorghum and other grass species with large repeat-rich genomes. RESULTS: DG templates are generated using one of three methylation sensitive restriction enzymes that recognize a nested set of 4, 6 or 8 bp GC-rich sequences, enabling varying depth of analysis and integration of results among assays. Variation in sequencing efficiency among DG markers was correlated with template GC-content and length. The expected DG allele sequence was obtained 97.3% of the time with a ratio of expected to alternative allele sequence acquisition of >20:1. A genetic map aligned to the sorghum genome sequence with an average resolution of 1.47 cM was constructed using 1,772 DG markers from 137 recombinant inbred lines. The DG map enhanced the detection of QTL for variation in plant height and precisely aligned QTL such as Dw3 to underlying genes/alleles. Higher-resolution NgoMIV-based DG haplotypes were used to trace the origin of DNA on SBI-06, spanning Ma1 and Dw2 from progenitors to BTx623 and IS3620C. DG marker analysis identified the correct location of two miss-assembled regions and located seven super contigs in the sorghum reference genome sequence. CONCLUSION: DG technology provides a cost-effective approach to rapidly generate accurate genotyping data in sorghum. Currently, data derived from DG are used for many marker-based analyses, including marker-assisted breeding, pedigree and QTL analysis, genetic map construction, map-based gene cloning and association studies. DG in combination with whole genome resequencing is dramatically accelerating all aspects of genetic analysis of sorghum, an important genetic reference for C4 grass species.


Assuntos
Genoma de Planta , Técnicas de Genotipagem/métodos , Sorghum/genética , Enzimas de Restrição do DNA , DNA de Plantas/genética , Marcadores Genéticos , Genótipo , Locos de Características Quantitativas , Análise de Sequência de DNA/métodos
11.
PLoS One ; 8(7): e67995, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23874484

RESUMO

In this paper we introduce an efficient algorithm for alignment of multiple large-scale biological networks. In this scheme, we first compute a probabilistic similarity measure between nodes that belong to different networks using a semi-Markov random walk model. The estimated probabilities are further enhanced by incorporating the local and the cross-species network similarity information through the use of two different types of probabilistic consistency transformations. The transformed alignment probabilities are used to predict the alignment of multiple networks based on a greedy approach. We demonstrate that the proposed algorithm, called SMETANA, outperforms many state-of-the-art network alignment techniques, in terms of computational efficiency, alignment accuracy, and scalability. Our experiments show that SMETANA can easily align tens of genome-scale networks with thousands of nodes on a personal computer without any difficulty. The source code of SMETANA is available upon request. The source code of SMETANA can be downloaded from http://www.ece.tamu.edu/~bjyoon/SMETANA/.


Assuntos
Algoritmos , Biologia Computacional/métodos , Probabilidade , Mapas de Interação de Proteínas , Sequência Conservada , Bases de Dados de Proteínas , Homologia de Sequência de Aminoácidos , Fatores de Tempo
12.
G3 (Bethesda) ; 3(5): 783-93, 2013 May 20.
Artigo em Inglês | MEDLINE | ID: mdl-23704283

RESUMO

To facilitate the mapping of genes in sorghum [Sorghum bicolor (L.) Moench] underlying economically important traits, we analyzed the genetic structure and linkage disequilibrium in a sorghum mini core collection of 242 landraces with 13,390 single-nucleotide polymorphims. The single-nucleotide polymorphisms were produced using a highly multiplexed genotyping-by-sequencing methodology. Genetic structure was established using principal component, Neighbor-Joining phylogenetic, and Bayesian cluster analyses. These analyses indicated that the mini-core collection was structured along both geographic origin and sorghum race classification. Examples of the former were accessions from Southern Africa, East Asia, and Yemen. Examples of the latter were caudatums with widespread geographical distribution, durras from India, and guineas from West Africa. Race bicolor, the most primitive and the least clearly defined sorghum race, clustered among other races and formed only one clear bicolor-centric cluster. Genome-wide linkage disequilibrium analyses showed linkage disequilibrium decayed, on average, within 10-30 kb, whereas the short arm of SBI-06 contained a linkage disequilibrium block of 20.33 Mb, confirming a previous report of low recombination on this chromosome arm. Four smaller but equally significant linkage disequilibrium blocks of 3.5-35.5 kb were detected on chromosomes 1, 2, 9, and 10. We examined the genes encoded within each block to provide a first look at candidates such as homologs of GS3 and FT that may indicate a selective sweep during sorghum domestication.


Assuntos
Variação Genética , Desequilíbrio de Ligação/genética , Modelos Biológicos , Sorghum/genética , Carbono/metabolismo , Cromossomos de Plantas/genética , Ecótipo , Eucromatina/metabolismo , Genes de Plantas/genética , Técnicas de Genotipagem , Heterocromatina/metabolismo , Filogenia , Dinâmica Populacional , Análise de Componente Principal
13.
PLoS One ; 7(8): e41474, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22912671

RESUMO

In this work, we introduce a novel network synthesis model that can generate families of evolutionarily related synthetic protein-protein interaction (PPI) networks. Given an ancestral network, the proposed model generates the network family according to a hypothetical phylogenetic tree, where the descendant networks are obtained through duplication and divergence of their ancestors, followed by network growth using network evolution models. We demonstrate that this network synthesis model can effectively create synthetic networks whose internal and cross-network properties closely resemble those of real PPI networks. The proposed model can serve as an effective framework for generating comprehensive benchmark datasets that can be used for reliable performance assessment of comparative network analysis algorithms. Using this model, we constructed a large-scale network alignment benchmark, called NAPAbench, and evaluated the performance of several representative network alignment algorithms. Our analysis clearly shows the relative performance of the leading network algorithms, with their respective advantages and disadvantages. The algorithm and source code of the network synthesis model and the network alignment benchmark NAPAbench are publicly available at http://www.ece.tamu.edu/bjyoon/NAPAbench/.


Assuntos
Biologia Computacional/métodos , Modelos Estatísticos , Mapas de Interação de Proteínas , Algoritmos , Animais , Benchmarking , Evolução Molecular , Humanos , Camundongos , Filogenia , Homologia de Sequência de Aminoácidos
14.
Bioinformatics ; 28(16): 2129-36, 2012 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-22730436

RESUMO

MOTIVATION: Recent technological advances in measuring molecular interactions have resulted in an increasing number of large-scale biological networks. Translation of these enormous network data into meaningful biological insights requires efficient computational techniques that can unearth the biological information that is encoded in the networks. One such example is network querying, which aims to identify similar subnetwork regions in a large target network that are similar to a given query network. Network querying tools can be used to identify novel biological pathways that are homologous to known pathways, thereby enabling knowledge transfer across different organisms. RESULTS: In this article, we introduce an efficient algorithm for querying large-scale biological networks, called RESQUE. The proposed algorithm adopts a semi-Markov random walk (SMRW) model to probabilistically estimate the correspondence scores between nodes that belong to different networks. The target network is iteratively reduced based on the estimated correspondence scores, which are also iteratively re-estimated to improve accuracy until the best matching subnetwork emerges. We demonstrate that the proposed network querying scheme is computationally efficient, can handle any network query with an arbitrary topology and yields accurate querying results. AVAILABILITY: The source code of RESQUE is freely available at http://www.ece.tamu.edu/~bjyoon/RESQUE/


Assuntos
Algoritmos , Biologia Computacional/métodos , Mapeamento de Interação de Proteínas/métodos , Software , Animais , Drosophila melanogaster , Redes Reguladoras de Genes , Humanos , Cadeias de Markov , Redes e Vias Metabólicas , Saccharomyces cerevisiae
15.
BMC Bioinformatics ; 12 Suppl 10: S6, 2011 Oct 18.
Artigo em Inglês | MEDLINE | ID: mdl-22165903

RESUMO

BACKGROUND: Comparative network analysis aims to identify common subnetworks in biological networks. It can facilitate the prediction of conserved functional modules across different species and provide deep insights into their underlying regulatory mechanisms. Recently, it has been shown that hidden Markov models (HMMs) can provide a flexible and computationally efficient framework for modeling and comparing biological networks. RESULTS: In this work, we show that using global correspondence scores between molecules can improve the accuracy of the HMM-based network alignment results. The global correspondence scores are computed by performing a semi-Markov random walk on the networks to be compared. The resulting score naturally integrates the sequence similarity between molecules and the topological similarity between their molecular interactions, thereby providing a more effective measure for estimating the functional similarity between molecules. By incorporating the global correspondence scores, instead of relying on sequence similarity or functional annotation scores used by previous approaches, our HMM-based network alignment method can identify conserved subnetworks that are functionally more coherent. CONCLUSIONS: Performance analysis based on synthetic and microbial networks demonstrates that the proposed network alignment strategy significantly improves the robustness and specificity of the predicted alignment results, in terms of conserved functional similarity measured based on KEGG ortholog (KO) groups. These results clearly show that the HMM-based network alignment framework using global correspondence scores can effectively find conserved biological pathways and has the potential to be used for automatic functional annotation of biomolecules.


Assuntos
Algoritmos , Bactérias/metabolismo , Cadeias de Markov , Proteínas de Bactérias/metabolismo , Modelos Biológicos , Mapas de Interação de Proteínas , Alinhamento de Sequência
16.
Nucleic Acids Res ; 39(Web Server issue): W8-12, 2011 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-21515632

RESUMO

In this article, we introduce PicXAA-Web, a web-based platform for accurate probabilistic alignment of multiple biological sequences. The core of PicXAA-Web consists of PicXAA, a multiple protein/DNA sequence alignment algorithm, and PicXAA-R, an extension of PicXAA for structural alignment of RNA sequences. Both PicXAA and PicXAA-R are probabilistic non-progressive alignment algorithms that aim to find the optimal alignment of multiple biological sequences by maximizing the expected accuracy. PicXAA and PicXAA-R greedily build up the alignment from sequence regions with high local similarity, thereby yielding an accurate global alignment that effectively captures local similarities among sequences. PicXAA-Web integrates these two algorithms in a user-friendly web platform for accurate alignment and analysis of multiple protein, DNA and RNA sequences. PicXAA-Web can be freely accessed at http://gsp.tamu.edu/picxaa/.


Assuntos
Alinhamento de Sequência/métodos , Software , Algoritmos , Internet , Reprodutibilidade dos Testes , Análise de Sequência de DNA , Análise de Sequência de Proteína , Análise de Sequência de RNA
17.
BMC Bioinformatics ; 12 Suppl 1: S38, 2011 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-21342569

RESUMO

BACKGROUND: Accurate and efficient structural alignment of non-coding RNAs (ncRNAs) has grasped more and more attentions as recent studies unveiled the significance of ncRNAs in living organisms. While the Sankoff style structural alignment algorithms cannot efficiently serve for multiple sequences, mostly progressive schemes are used to reduce the complexity. However, this idea tends to propagate the early stage errors throughout the entire process, thereby degrading the quality of the final alignment. For multiple protein sequence alignment, we have recently proposed PicXAA which constructs an accurate alignment in a non-progressive fashion. RESULTS: Here, we propose PicXAA-R as an extension to PicXAA for greedy structural alignment of ncRNAs. PicXAA-R efficiently grasps both folding information within each sequence and local similarities between sequences. It uses a set of probabilistic consistency transformations to improve the posterior base-pairing and base alignment probabilities using the information of all sequences in the alignment. Using a graph-based scheme, we greedily build up the structural alignment from sequence regions with high base-pairing and base alignment probabilities. CONCLUSIONS: Several experiments on datasets with different characteristics confirm that PicXAA-R is one of the fastest algorithms for structural alignment of multiple RNAs and it consistently yields accurate alignment results, especially for datasets with locally similar sequences. PicXAA-R source code is freely available at: http://www.ece.tamu.edu/~bjyoon/picxaa/.


Assuntos
Algoritmos , Modelos Estatísticos , RNA não Traduzido/química , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Pareamento de Bases , Biologia Computacional/métodos
18.
Nucleic Acids Res ; 38(15): 4917-28, 2010 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-20413579

RESUMO

Accurate tools for multiple sequence alignment (MSA) are essential for comparative studies of the function and structure of biological sequences. However, it is very challenging to develop a computationally efficient algorithm that can consistently predict accurate alignments for various types of sequence sets. In this article, we introduce PicXAA (Probabilistic Maximum Accuracy Alignment), a probabilistic non-progressive alignment algorithm that aims to find protein alignments with maximum expected accuracy. PicXAA greedily builds up the multiple alignment from sequence regions with high local similarities, thereby yielding an accurate global alignment that effectively grasps the local similarities among sequences. Evaluations on several widely used benchmark sets show that PicXAA constantly yields accurate alignment results on a wide range of reference sets, with especially remarkable improvements over other leading algorithms on sequence sets with local similarities. PicXAA source code is freely available at: http://www.ece.tamu.edu/~bjyoon/picxaa/.


Assuntos
Algoritmos , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína , Biologia Computacional , Probabilidade
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA