RESUMO
BACKGROUND: Adding sequences into an existing (possibly user-provided) alignment has multiple applications, including updating a large alignment with new data, adding sequences into a constraint alignment constructed using biological knowledge, or computing alignments in the presence of sequence length heterogeneity. Although this is a natural problem, only a few tools have been developed to use this information with high fidelity. RESULTS: We present EMMA (Extending Multiple alignments using MAFFT--add) for the problem of adding a set of unaligned sequences into a multiple sequence alignment (i.e., a constraint alignment). EMMA builds on MAFFT--add, which is also designed to add sequences into a given constraint alignment. EMMA improves on MAFFT--add methods by using a divide-and-conquer framework to scale its most accurate version, MAFFT-linsi--add, to constraint alignments with many sequences. We show that EMMA has an accuracy advantage over other techniques for adding sequences into alignments under many realistic conditions and can scale to large datasets with high accuracy (hundreds of thousands of sequences). EMMA is available at https://github.com/c5shen/EMMA . CONCLUSIONS: EMMA is a new tool that provides high accuracy and scalability for adding sequences into an existing alignment.
RESUMO
Synchronization (insertions-deletions) errors are still a major challenge for reliable information retrieval in DNA storage. Unlike traditional error correction codes (ECC) that add redundancy in the stored information, multiple sequence alignment (MSA) solves this problem by searching the conserved subsequences. In this paper, we conduct a comprehensive simulation study on the error correction capability of a typical MSA algorithm, MAFFT. Our results reveal that its capability exhibits a phase transition when there are around 20% errors. Below this critical value, increasing sequencing depth can eventually allow it to approach complete recovery. Otherwise, its performance plateaus at some poor levels. Given a reasonable sequencing depth (≤ 70), MSA could achieve complete recovery in the low error regime, and effectively correct 90% of the errors in the medium error regime. In addition, MSA is robust to imperfect clustering. It could also be combined with other means such as ECC, repeated markers, or any other code constraints. Furthermore, by selecting an appropriate sequencing depth, this strategy could achieve an optimal trade-off between cost and reading speed. MSA could be a competitive alternative for future DNA storage.
Assuntos
Algoritmos , DNA , Alinhamento de Sequência , DNA/genética , Simulação por Computador , Análise de Sequência de DNARESUMO
Long DNA and RNA reads from nanopore and PacBio technologies have many applications, but the raw reads have a substantial error rate. More accurate sequences can be obtained by merging multiple reads from overlapping parts of the same sequence. lamassemble aligns up to â¼1000 reads to each other, and makes a consensus sequence, which is often much more accurate than the raw reads. It is useful for studying a region of interest such as an expanded tandem repeat or other disease-causing mutation.
Assuntos
Sequência Consenso , Genômica/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Animais , Técnicas Genéticas , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , NanoporosRESUMO
The book chapter introduces the National Center for Biotechnology Information (NCBI) Genome Workbench, a desktop GUI software package to manipulate and visualize complex molecular biology models provided in many data formats. Genome Workbench integrates graphical views and computational tools in a single package to facilitate discoveries. In this chapter we provide a step-by-step protocol guidance on how to do comparative analysis of sequences using NCBI BLAST and multiple sequence alignment algorithms, build phylogenetic trees, and use graphical views for sequences, alignments, and trees to validate the findings. The software package can be used to prepare high-quality whole genome submissions to NCBI. The software package is user-friendly and includes validation and editing tools to fix errors as part of preparing the submission.
Assuntos
Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos/organização & administração , Genômica/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Genoma/genética , Filogenia , PubMed/organização & administraçãoRESUMO
Iron is an essential micronutrient for most living beings since it participates as a redox active cofactor in many biological processes including cellular respiration, lipid biosynthesis, DNA replication and repair, and ribosome biogenesis and recycling. However, when present in excess, iron can participate in Fenton reactions and generate reactive oxygen species that damage cells at the level of proteins, lipids and nucleic acids. Organisms have developed different molecular strategies to protect themselves against the harmful effects of high concentrations of iron. In the case of fungi and plants, detoxification mainly occurs by importing cytosolic iron into the vacuole through the Ccc1/VIT1 iron transporter. New sequenced genomes and bioinformatic tools are facilitating the functional characterization, evolution and ecological relevance of metabolic pathways and homeostatic networks across the Tree of Life. Sequence analysis shows that Ccc1/VIT1 homologs are widely distributed among organisms with the exception of animals. The recent elucidation of the crystal structure of a Ccc1/VIT1 plant ortholog has enabled the identification of both conserved and species-specific motifs required for its metal transport mechanism. Moreover, recent studies in the yeast Saccharomyces cerevisiae have also revealed that multiple transcription factors including Yap5 and Msn2/Msn4 contribute to the expression of CCC1 in high-iron conditions. Interestingly, Malaysian S. cerevisiae strains express a partially functional Ccc1 protein that renders them sensitive to iron. Different regulatory mechanisms have been described for non-Saccharomycetaceae Ccc1 homologs. The characterization of Ccc1/VIT1 proteins is of high interest in the development of biofortified crops and the protection against microbial-derived diseases.
RESUMO
Spiders are among the world's most species-rich animal lineages, and their visual systems are likewise highly diverse. These modular visual systems, composed of four pairs of image-forming "camera" eyes, have taken on a huge variety of forms, exhibiting variation in eye size, eye placement, image resolution, and field of view, as well as sensitivity to color, polarization, light levels, and motion cues. However, despite this conspicuous diversity, our understanding of the genetic underpinnings of these visual systems remains shallow. Here, we review the current literature, analyze publicly available transcriptomic data, and discuss hypotheses about the origins and development of spider eyes. Our efforts highlight that there are many new things to discover from spider eyes, and yet these opportunities are set against a backdrop of deep homology with other arthropod lineages. For example, many (but not all) of the genes that appear important for early eye development in spiders are familiar players known from the developmental networks of other model systems (e.g., Drosophila). Similarly, our analyses of opsins and related phototransduction genes suggest that spider photoreceptors employ many of the same genes and molecular mechanisms known from other arthropods, with a hypothesized ancestral spider set of four visual and four nonvisual opsins. This deep homology provides a number of useful footholds into new work on spider vision and the molecular basis of its extant variety. We therefore discuss what some of these first steps might be in the hopes of convincing others to join us in studying the vision of these fascinating creatures.
Assuntos
Evolução Molecular , Aranhas/genética , Animais , Opsinas/genética , Células Fotorreceptoras de Invertebrados/fisiologia , Aranhas/classificaçãoRESUMO
BACKGROUND: Multiple sequence alignment (MSA) is one of the most important research contents in bioinformatics. A number of MSA programs have emerged. The accuracy of MSA programs highly depends on the parameters setting, mainly including gap open penalties (GOP), gap extension penalties (GEP) and substitution matrix (SM). This research tries to obtain the optimal GOP, GEP and SM rather than MAFFT default parameters. RESULTS: The paper discusses the MAFFT program benchmarked on BAliBASE3.0 database, and the optimal parameters of MAFFT program are obtained, which are better than the default parameters of CLUSTALW and MAFFT program. CONCLUSIONS: The optimal parameters can improve the results of multiple sequence alignment, which is feasible and efficient.
RESUMO
BACKGROUND: Progressive alignment is the standard approach used to align large numbers of sequences. As with all heuristics, this involves a tradeoff between alignment accuracy and computation time. RESULTS: We examine this tradeoff and find that, because of a loss of information in the early steps of the approach, the alignments generated by the most common multiple sequence alignment programs are inherently unstable, and simply reversing the order of the sequences in the input file will cause a different alignment to be generated. Although this effect is more obvious with larger numbers of sequences, it can also be seen with data sets in the order of one hundred sequences. We also outline the means to determine the number of sequences in a data set beyond which the probability of instability will become more pronounced. CONCLUSIONS: This has major ramifications for both the designers of large-scale multiple sequence alignment algorithms, and for the users of these alignments.
RESUMO
Multiple sequence alignment plays a key role in the computational analysis of biological data. Different programs are developed to analyze the sequence similarity. This paper highlights the algorithmic techniques of the most popular multiple sequence alignment programs. These programs are then evaluated on the basis of execution time and scalability. The overall performance of these programs is assessed to highlight their strengths and weaknesses with reference to their algorithmic techniques. In terms of overall alignment quality, T-Coffee and Mafft attain the highest average scores, whereas K-align has the minimum computation time.
Assuntos
Algoritmos , Reconhecimento Automatizado de Padrão/métodos , Alinhamento de Sequência/métodos , Análise de Sequência/métodos , Software , Reprodutibilidade dos Testes , Sensibilidade e EspecificidadeRESUMO
The CIPRES Science Gateway is a community web application that provides public access to a set of parallel tree inference and multiple sequence alignment codes run on large computational resources. These resources are made available at no charge to users by the NSF Extreme Science and Engineering Discovery Environment (XSEDE) project. Here we describe the CIPRES RESTful application programmer interface (CRA), a web service that provides programmatic access to all resources and services currently offered by the CIPRES Science Gateway. Software developers can use the CRA to extend their web or desktop applications to include the ability to run MrBayes, BEAST, RAxML, MAFFT, and other computationally intensive algorithms on XSEDE. The CRA also makes it possible for individuals with modest scripting skills to access the same tools from the command line using curl, or through any scripting language. This report describes the CRA and its use in three web applications (Influenza Research Database - www.fludb.org, Virus Pathogen Resource - www.viprbrc.org, and MorphoBank - www.morphobank.org). The CRA is freely accessible to registered users at https://cipresrest.sdsc.edu/cipresrest/v1; supporting documentation and registration tools are available at https://www.phylo.org/restusers.
RESUMO
A phylogenetic hypothesis for the lepidopteran superfamily Noctuoidea was inferred based on the complete mitochondrial (mt) genomes of 12 species (six newly sequenced). The monophyly of each noctuoid family in the latest classification was well supported. Novel and robust relationships were recovered at the family level, in contrast to previous analyses using nuclear genes. Erebidae was recovered as sister to (Nolidae+(Euteliidae+Noctuidae)), while Notodontidae was sister to all these taxa (the putatively basalmost lineage Oenosandridae was not included). In order to improve phylogenetic resolution using mt genomes, various analytical approaches were tested: Bayesian inference (BI) vs. maximum likelihood (ML), excluding vs. including RNA genes (rRNA or tRNA), and Gblocks treatment. The evolutionary signal within mt genomes had low sensitivity to analytical changes. Inference methods had the most significant influence. Inclusion of tRNAs positively increased the congruence of topologies, while inclusion of rRNAs resulted in a range of phylogenetic relationships varying depending on other analytical factors. The two Gblocks parameter settings had opposite effects on nodal support between the two inference methods. The relaxed parameter (GBRA) resulted in higher support values in BI analyses, while the strict parameter (GBDH) resulted in higher support values in ML analyses.
Assuntos
Genoma de Inseto , Genoma Mitocondrial , Mariposas/classificação , Filogenia , Animais , Teorema de Bayes , DNA Mitocondrial/genética , Ordem dos Genes , Lepidópteros/genética , Funções Verossimilhança , Mariposas/genética , RNA Ribossômico/genética , RNA de Transferência/genética , Análise de Sequência de DNARESUMO
Guide trees are used to decide the order of sequence alignment in the progressive multiple sequence alignment heuristic. These guide trees are often the limiting factor in making large alignments, and considerable effort has been expended over the years in making these quickly or accurately. In this article we show that, at least for protein families with large numbers of sequences that can be benchmarked with known structures, simple chained guide trees give the most accurate alignments. These also happen to be the fastest and simplest guide trees to construct, computationally. Such guide trees have a striking effect on the accuracy of alignments produced by some of the most widely used alignment packages. There is a marked increase in accuracy and a marked decrease in computational time, once the number of sequences goes much above a few hundred. This is true, even if the order of sequences in the guide tree is random.