Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33621981

RESUMO

Contigs assembled from the third-generation sequencing long reads are usually more complete than the second-generation short reads. However, the current algorithms still have difficulty in assembling the long reads into the ideal complete and accurate genome, or the theoretical best result [1]. To improve the long read contigs and with more and more fully sequenced genomes available, it could still be possible to use the similar genome-assisted reassembly method [2], which was initially proposed for the short reads making use of a closely related genome (similar genome) to the sequencing genome (target genome). The method aligns the contigs and reads to the similar genome, and then extends and refines the aligned contigs with the aligned reads. Here, we introduce AlignGraph2, a similar genome-assisted reassembly pipeline for the PacBio long reads. The AlignGraph2 pipeline is the second version of AlignGraph algorithm proposed by us but completely redesigned, can be inputted with either error-prone or HiFi long reads, and contains four novel algorithms: similarity-aware alignment algorithm and alignment filtration algorithm for alignment of the long reads and preassembled contigs to the similar genome, and reassembly algorithm and weight-adjusted consensus algorithm for extension and refinement of the preassembled contigs. In our performance tests on both error-prone and HiFi long reads, AlignGraph2 can align 5.7-27.2% more long reads and 7.3-56.0% more bases than some current alignment algorithm and is more efficient or comparable to the others. For contigs assembled with various de novo algorithms and aligned to similar genomes (aligned contigs), AlignGraph2 can extend 8.7-94.7% of them (extendable contigs), and obtain contigs of 7.0-249.6% larger N50 value and 5.2-87.7% smaller number of indels per 100 kbp (extended contigs). With genomes of decreased similarities, AlignGraph2 also has relatively stable performance. The AlignGraph2 software can be downloaded for free from this site: https://github.com/huangs001/AlignGraph2.


Assuntos
Algoritmos , Arabidopsis/genética , Ilhas de CpG/genética , Genoma Fúngico , Genoma Humano , Genoma de Planta , Saccharomyces cerevisiae/genética , Alinhamento de Sequência/métodos , Software , Mapeamento de Sequências Contíguas/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Sequenciamento Completo do Genoma/métodos
2.
Bioinformatics ; 35(20): 3953-3960, 2019 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-30895306

RESUMO

MOTIVATION: The third generation PacBio long reads have greatly facilitated sequencing projects with very large read lengths, but they contain about 15% sequencing errors and need error correction. For the projects with long reads only, it is challenging to make correction with fast speed, and also challenging to correct a sufficient amount of read bases, i.e. to achieve high-throughput self-correction. MECAT is currently among the fastest self-correction algorithms, but its throughput is relatively small (Xiao et al., 2017). RESULTS: Here, we introduce FLAS, a wrapper algorithm of MECAT, to achieve high-throughput long-read self-correction while keeping MECAT's fast speed. FLAS finds additional alignments from MECAT prealigned long reads to improve the correction throughput, and removes misalignments for accuracy. In addition, FLAS also uses the corrected long-read regions to correct the uncorrected ones to further improve the throughput. In our performance tests on Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana and human long reads, FLAS can achieve 22.0-50.6% larger throughput than MECAT. FLAS is 2-13× faster compared to the self-correction algorithms other than MECAT, and its throughput is also 9.8-281.8% larger. The FLAS corrected long reads can be assembled into contigs of 13.1-29.8% larger N50 sizes than MECAT. AVAILABILITY AND IMPLEMENTATION: The FLAS software can be downloaded for free from this site: https://github.com/baoe/flas. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Saccharomyces cerevisiae , Análise de Sequência de DNA , Software
3.
PLoS One ; 13(3): e0193239, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29596449

RESUMO

The potato aphid, Macrosiphum euphorbiae, is an important agricultural pest that causes economic losses to potato and tomato production. To establish the transcriptome for this aphid, RNA-Seq libraries constructed from aphids maintained on tomato plants were used in Illumina sequencing generating 52.6 million 75-105 bp paired-end reads. The reads were assembled using Velvet/Oases software with SEED preprocessing resulting in 22,137 contigs with an N50 value of 2,003bp. After removal of contigs from tomato host origin, 20,254 contigs were annotated using BLASTx searches against the non-redundant protein database from the National Center for Biotechnology Information (NCBI) as well as IntereProScan. This identified matches for 74% of the potato aphid contigs. The highest ranking hits for over 12,700 contigs were against the related pea aphid, Acyrthosiphon pisum. Gene Ontology (GO) was used to classify the identified M. euphorbiae contigs into biological process, cellular component and molecular function. Among the contigs, sequences of microbial origin were identified. Sixty five contigs were from the aphid bacterial obligate endosymbiont Buchnera aphidicola origin and two contigs had amino acid similarities to viruses. The latter two were named Macrosiphum euphorbiae virus 2 (MeV-2) and Macrosiphum euphorbiae virus 3 (MeV-3). The highest sequence identity to MeV-2 had the Dysaphis plantaginea densovirus, while to MeV-3 is the Hubei sobemo-like virus 49. Characterization of MeV-2 and MeV-3 indicated that both are transmitted vertically from adult aphids to nymphs. MeV-2 peptides were detected in the aphid saliva and only MeV-2 and not MeV-3 nucleic acids were detected inside tomato leaves exposed to virus-infected aphids. However, MeV-2 nucleic acids did not persist in tomato leaf tissues, after clearing the plants from aphids, indicating that MeV-2 is likely an aphid virus.


Assuntos
Afídeos/genética , Afídeos/virologia , Perfilação da Expressão Gênica , Vírus de Plantas/genética , Vírus de Plantas/isolamento & purificação , Análise de Sequência , Sequência de Aminoácidos , Animais , Ontologia Genética , Anotação de Sequência Molecular , Vírus de Plantas/fisiologia , Proteínas Virais/química , Proteínas Virais/genética
4.
Bioinformatics ; 34(1): 24-32, 2018 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-28961789

RESUMO

Motivation: Contigs assembled from the second generation sequencing short reads may contain misassemblies, and thus complicate downstream analysis or even lead to incorrect analysis results. Fortunately, with more and more sequenced species available, it becomes possible to use the reference genome of a closely related species to detect misassemblies. In addition, long reads of the third generation sequencing technology have been more and more widely used, and can also help detect misassemblies. Results: Here, we introduce ReMILO, a reference assisted misassembly detection algorithm that uses both short reads and PacBio SMRT long reads. ReMILO aligns the initial short reads to both the contigs and reference genome, and then constructs a novel data structure called red-black multipositional de Bruijn graph to detect misassemblies. In addition, ReMILO also aligns the contigs to long reads and find their differences from the long reads to detect more misassemblies. In our performance test on short read assemblies of human chromosome 14 data, ReMILO can detect 41.8-77.9% extensive misassemblies and 33.6-54.5% local misassemblies. On hybrid short and long read assemblies of S.pastorianus data, ReMILO can also detect 60.6-70.9% extensive misassemblies and 28.6-54.0% local misassemblies. Availability and implementation: The ReMILO software can be downloaded for free under Artistic License 2.0 from this site: https://github.com/songc001/remilo. Contact: baoe@bjtu.edu.cn. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Cromossomos Humanos Par 14 , Humanos
5.
BMC Bioinformatics ; 18(1): 204, 2017 Apr 05.
Artigo em Inglês | MEDLINE | ID: mdl-28381259

RESUMO

BACKGROUND: The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis. RESULTS: Here, we introduce HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region's repeats in the contigs sufficiently similar to it (similar repeat based alignment approach). It then constructs a contig graph and, for each long read, references the other long reads' alignments to find the most accurate alignment and correct it with the aligned contig regions (long read support based validation approach). Even though some long read regions without the true genome regions in the contigs are corrected with their repeats, this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data sets, HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms. CONCLUSIONS: The HALC software can be downloaded for free from this site: https://github.com/lanl001/halc .


Assuntos
Algoritmos , Animais , Arabidopsis/genética , Ciclídeos/genética , Mapeamento de Sequências Contíguas , Escherichia coli/genética , Genoma Bacteriano , Genoma de Planta , Sequenciamento de Nucleotídeos em Larga Escala , Internet , Análise de Sequência de DNA , Interface Usuário-Computador
6.
Bioinformatics ; 30(12): i319-i328, 2014 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-24932000

RESUMO

MOTIVATION: De novo assemblies of genomes remain one of the most challenging applications in next-generation sequencing. Usually, their results are incomplete and fragmented into hundreds of contigs. Repeats in genomes and sequencing errors are the main reasons for these complications. With the rapidly growing number of sequenced genomes, it is now feasible to improve assemblies by guiding them with genomes from related species. RESULTS: Here we introduce AlignGraph, an algorithm for extending and joining de novo-assembled contigs or scaffolds guided by closely related reference genomes. It aligns paired-end (PE) reads and preassembled contigs or scaffolds to a close reference. From the obtained alignments, it builds a novel data structure, called the PE multipositional de Bruijn graph. The incorporated positional information from the alignments and PE reads allows us to extend the initial assemblies, while avoiding incorrect extensions and early terminations. In our performance tests, AlignGraph was able to substantially improve the contigs and scaffolds from several assemblers. For instance, 28.7-62.3% of the contigs of Arabidopsis thaliana and human could be extended, resulting in improvements of common assembly metrics, such as an increase of the N50 of the extendable contigs by 89.9-94.5% and 80.3-165.8%, respectively. In another test, AlignGraph was able to improve the assembly of a published genome (Arabidopsis strain Landsberg) by increasing the N50 of its extendable scaffolds by 86.6%. These results demonstrate AlignGraph's efficiency in improving genome assemblies by taking advantage of closely related references. AVAILABILITY AND IMPLEMENTATION: The AlignGraph software can be downloaded for free from this site: https://github.com/baoe/AlignGraph.


Assuntos
Algoritmos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Arabidopsis/genética , Genoma , Genômica/normas , Sequenciamento de Nucleotídeos em Larga Escala/normas , Humanos , Padrões de Referência , Análise de Sequência de DNA/normas , Software
7.
Bioinformatics ; 29(10): 1250-9, 2013 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-23493323

RESUMO

MOTIVATION: De novo transcriptome assemblies of RNA-Seq data are important for genomics applications of unsequenced organisms. Owing to the complexity and often incomplete representation of transcripts in sequencing libraries, the assembly of high-quality transcriptomes can be challenging. However, with the rapidly growing number of sequenced genomes, it is now feasible to improve RNA-Seq assemblies by guiding them with genomic sequences. RESULTS: This study introduces BRANCH, an algorithm designed for improving de novo transcriptome assemblies by using genomic information that can be partial or complete genome sequences from the same or a related organism. Its input includes assembled RNA reads (transfrags), genomic sequences (e.g. contigs) and the RNA reads themselves. It uses a customized version of BLAT to align the transfrags and RNA reads to the genomic sequences. After identifying exons from the alignments, it defines a directed acyclic graph and maps the transfrags to paths on the graph. It then joins and extends the transfrags by applying an algorithm that solves a combinatorial optimization problem, called the Minimum weight Minimum Path Cover with given Paths. In performance tests on real data from Caenorhabditis elegans and Saccharomyces cerevisiae, assisted by genomic contigs from the same species, BRANCH improved the sensitivity and precision of transfrags generated by Velvet/Oases or Trinity by 5.1-56.7% and 0.3-10.5%, respectively. These improvements added 3.8-74.1% complete transcripts and 8.3-3.8% proteins to the initial assembly. Similar improvements were achieved when guiding the BRANCH processing of a transcriptome assembly from a more complex organism (mouse) with genomic sequences from a related species (rat). AVAILABILITY: The BRANCH software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/branch. CONTACT: thomas.girke@ucr.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Anotação de Sequência Molecular , Análise de Sequência de RNA , Animais , Caenorhabditis elegans/genética , Genômica , Camundongos , Ratos , Saccharomyces cerevisiae/genética , Software , Transcriptoma
8.
Mol Plant Microbe Interact ; 26(1): 67-74, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-23194342

RESUMO

The interactions between aphids and their host plants seem to be analogous to those of plant-microbial pathogens. Unlike microbial pathogen effectors, little is known about aphid effectors and their ability to interfere with host immunity. To date, only three functional aphid effectors have been reported. To identify potato aphid (Macrosiphum euphorbiae) effectors, we developed a salivary gland transcriptome using Illumina technology. We generated 85 million Illumina reads from salivary glands and assembled them into 646 contigs. Ab initio sequence analysis predicted secretion signal peptides in 24% of these sequences, suggesting that they might be secreted into the plant during aphid feeding. Eight of these candidate effectors with secretion signal peptides were functionally characterized using Agrobacterium tumefaciens-mediated transient overexpression in Nicotiana benthamiana. Two candidate effectors, Me10 and Me23, increased aphid fecundity, suggesting their ability to suppress N. benthamiana defenses. Five of these candidate effectors, including Me10 and Me23, were also analyzed in tomato by delivering them through the Pseudomonas syringae type three secretion system. In tomato, only Me10 increased aphid fecundity. This work identified two additional aphid effectors with ability to manipulate the host for their advantage.


Assuntos
Afídeos/genética , Regulação da Expressão Gênica/genética , Proteínas de Insetos/metabolismo , Doenças das Plantas/parasitologia , Solanum tuberosum/parasitologia , Transcriptoma , Sequência de Aminoácidos , Animais , Afídeos/fisiologia , Bioensaio , Sequência Conservada , Fertilidade , Biblioteca Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Interações Hospedeiro-Parasita , Proteínas de Insetos/química , Proteínas de Insetos/genética , Solanum lycopersicum/parasitologia , Dados de Sequência Molecular , Ninfa , Especificidade de Órgãos , Sinais Direcionadores de Proteínas , Pseudomonas syringae/genética , Glândulas Salivares/metabolismo , Alinhamento de Sequência , Nicotiana/parasitologia
9.
Bioinformatics ; 27(18): 2502-9, 2011 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-21810899

RESUMO

MOTIVATION: Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads. RESULTS: Here, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms. AVAILABILITY: The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed. CONTACT: thomas.girke@ucr.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequência de Bases , Análise de Sequência/métodos , Algoritmos , Análise por Conglomerados , Genoma , MicroRNAs , Análise de Sequência de RNA , Software , Transcriptoma
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...