Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 69
Filtrar
1.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-35849103

RESUMO

Accurate identification of genetic variants from family child-mother-father trio sequencing data is important in genomics. However, state-of-the-art approaches treat variant calling from trios as three independent tasks, which limits their calling accuracy for Nanopore long-read sequencing data. For better trio variant calling, we introduce Clair3-Trio, the first variant caller tailored for family trio data from Nanopore long-reads. Clair3-Trio employs a Trio-to-Trio deep neural network model, which allows it to input the trio sequencing information and output all of the trio's predicted variants within a single model to improve variant calling. We also present MCVLoss, a novel loss function tailor-made for variant calling in trios, leveraging the explicit encoding of the Mendelian inheritance. Clair3-Trio showed comprehensive improvement in experiments. It predicted far fewer Mendelian inheritance violation variations than current state-of-the-art methods. We also demonstrated that our Trio-to-Trio model is more accurate than competing architectures. Clair3-Trio is accessible as a free, open-source project at https://github.com/HKU-BAL/Clair3-Trio.


Assuntos
Nanoporos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Redes Neurais de Computação , Análise de Sequência de DNA , Software
2.
J Transl Med ; 22(1): 122, 2024 01 31.
Artigo em Inglês | MEDLINE | ID: mdl-38297333

RESUMO

BACKGROUND: Emerging evidence suggests that Rho GTPases play a crucial role in tumorigenesis and metastasis, but their involvement in the tumor microenvironment (TME) and prognosis of hepatocellular carcinoma (HCC) is not well understood. METHODS: We aim to develop a tumor prognosis prediction system called the Rho GTPases-related gene score (RGPRG score) using Rho GTPase signaling genes and further bioinformatic analyses. RESULTS: Our work found that HCC patients with a high RGPRG score had significantly worse survival and increased immunosuppressive cell fractions compared to those with a low RGPRG score. Single-cell cohort analysis revealed an immune-active TME in patients with a low RGPRG score, with strengthened communication from T/NK cells to other cells through MIF signaling networks. Targeting these alterations in TME, the patients with high RGPRG score have worse immunotherapeutic outcomes and decreased survival time in the immunotherapy cohort. Moreover, the RGPRG score was found to be correlated with survival in 27 other cancers. In vitro experiments confirmed that knockdown of the key Rho GTPase-signaling biomarker SFN significantly inhibited HCC cell proliferation, invasion, and migration. CONCLUSIONS: This study provides new insight into the TME features and clinical use of Rho GTPase gene pattern at the bulk-seq and single-cell level, which may contribute to guiding personalized treatment and improving clinical outcome in HCC.


Assuntos
Carcinoma Hepatocelular , Neoplasias Hepáticas , Humanos , Carcinoma Hepatocelular/genética , Neoplasias Hepáticas/genética , Carcinogênese , Linhagem Celular , Imunossupressores , Proteínas rho de Ligação ao GTP , Microambiente Tumoral
3.
BMC Bioinformatics ; 24(1): 308, 2023 Aug 03.
Artigo em Inglês | MEDLINE | ID: mdl-37537536

RESUMO

BACKGROUND: With the continuous advances in third-generation sequencing technology and the increasing affordability of next-generation sequencing technology, sequencing data from different sequencing technology platforms is becoming more common. While numerous benchmarking studies have been conducted to compare variant-calling performance across different platforms and approaches, little attention has been paid to the potential of leveraging the strengths of different platforms to optimize overall performance, especially integrating Oxford Nanopore and Illumina sequencing data. RESULTS: We investigated the impact of multi-platform data on the performance of variant calling through carefully designed experiments with a deep learning-based variant caller named Clair3-MP (Multi-Platform). Through our research, we not only demonstrated the capability of ONT-Illumina data for improved variant calling, but also identified the optimal scenarios for utilizing ONT-Illumina data. In addition, we revealed that the improvement in variant calling using ONT-Illumina data comes from an improvement in difficult genomic regions, such as the large low-complexity regions and segmental and collapse duplication regions. Moreover, Clair3-MP can incorporate reference genome stratification information to achieve a small but measurable improvement in variant calling. Clair3-MP is accessible as an open-source project at: https://github.com/HKU-BAL/Clair3-MP . CONCLUSIONS: These insights have important implications for researchers and practitioners alike, providing valuable guidance for improving the reliability and efficiency of genomic analysis in diverse applications.


Assuntos
Genoma , Genômica , Reprodutibilidade dos Testes , Sequenciamento de Nucleotídeos em Larga Escala
4.
Clin Chem ; 69(10): 1174-1185, 2023 10 03.
Artigo em Inglês | MEDLINE | ID: mdl-37537871

RESUMO

BACKGROUND: HIV infections often develop drug resistance mutations (DRMs), which can increase the risk of virological failure. However, it has been difficult to determine if minor mutations occur in the same genome or in different virions using Sanger sequencing and short-read sequencing methods. Oxford Nanopore Technologies (ONT) sequencing may improve antiretroviral resistance profiling by allowing for long-read clustering. METHODS: A new ONT sequencing-based method for profiling DRMs in HIV quasispecies was developed and validated. The method used hierarchical clustering of long amplicons that cover regions associated with different types of antiretroviral drugs. A gradient series of an HIV plasmid and 2 plasma samples was prepared to validate the clustering performance. The ONT results were compared to those obtained with Sanger sequencing and Illumina sequencing in 77 HIV-positive plasma samples to evaluate the diagnostic performance. RESULTS: In the validation study, the abundance of detected quasispecies was concordant with the predicted result with the R2 of > 0.99. During the diagnostic evaluation, 59/77 samples were successfully sequenced for DRMs. Among 18 failed samples, 17 were below the limit of detection of 303.9 copies/µL. Based on the receiver operating characteristic analysis, the ONT workflow achieved an F1 score of 0.96 with a cutoff of 0.4 variant allele frequency. Four cases were found to have quasispecies with DRMs, in which 2 harbored quasispecies with more than one class of DRMs. Treatment modifications were recommended for these cases. CONCLUSIONS: Long-read sequencing coupled with hierarchical clustering could differentiate the quasispecies resistance profiles in HIV-infected samples, providing a clearer picture for medical care.


Assuntos
Infecções por HIV , HIV-1 , Humanos , Infecções por HIV/tratamento farmacológico , Quase-Espécies/genética , HIV-1/genética , Antirretrovirais/farmacologia , Antirretrovirais/uso terapêutico , Mutação , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise por Conglomerados
5.
BMC Bioinformatics ; 23(1): 465, 2022 Nov 07.
Artigo em Inglês | MEDLINE | ID: mdl-36344913

RESUMO

BACKGROUND: Whole genome sequencing using the long-read Oxford Nanopore Technologies (ONT) MinION sequencer provides a cost-effective option for structural variant (SV) detection in clinical applications. Despite the advantage of using long reads, however, accurate SV calling and phasing are still challenging. RESULTS: We introduce Duet, an SV detection tool optimized for SV calling and phasing using ONT data. The tool uses novel features integrated from both SV signatures and single-nucleotide polymorphism signatures, which can accurately distinguish SV haplotype from a false signal. Duet was benchmarked against state-of-the-art tools on multiple ONT sequencing datasets of sequencing coverage ranging from 8× to 40×. At low sequencing coverage of 8×, Duet performs better than all other tools in SV calling, SV genotyping and SV phasing. When the sequencing coverage is higher (20× to 40×), the F1-score for SV phasing is further improved in comparison to the performance of other tools, while its performance of SV genotyping and SV calling remains higher than other tools. CONCLUSION: Duet can perform accurate SV calling, SV genotyping and SV phasing using low-coverage ONT data, making it very useful for low-coverage genomes. It has great performance when scaled to high-coverage genomes, which is adaptable to various clinical applications. Duet is open source and is available at https://github.com/yekaizhou/duet .


Assuntos
Sequenciamento por Nanoporos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA , Sequenciamento de Nucleotídeos em Larga Escala , Sequenciamento Completo do Genoma
6.
Clin Infect Dis ; 73(11): e4154-e4165, 2021 12 06.
Artigo em Inglês | MEDLINE | ID: mdl-33388749

RESUMO

BACKGROUND: Children and older adults with coronavirus disease 2019 (COVID-19) display a distinct spectrum of disease severity yet the risk factors aren't well understood. We sought to examine the expression pattern of angiotensin-converting enzyme 2 (ACE2), the cell-entry receptor for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and the role of lung progenitor cells in children and older patients. METHODS: We retrospectively analyzed clinical features in a cohort of 299 patients with COVID-19. The expression and distribution of ACE2 and lung progenitor cells were systematically examined using a combination of public single-cell RNA-seq data sets, lung biopsies, and ex vivo infection of lung tissues with SARS-CoV-2 pseudovirus in children and older adults. We also followed up patients who had recovered from COVID-19. RESULTS: Compared with children, older patients (>50 years.) were more likely to develop into serious pneumonia with reduced lymphocytes and aberrant inflammatory response (P = .001). The expression level of ACE2 and lung progenitor cell markers were generally decreased in older patients. Notably, ACE2 positive cells were mainly distributed in the alveolar region, including SFTPC positive cells, but rarely in airway regions in the older adults (P < .01). The follow-up of discharged patients revealed a prolonged recovery from pneumonia in the older (P < .025). CONCLUSIONS: Compared to children, ACE2 positive cells are generally decreased in older adults and mainly presented in the lower pulmonary tract. The lung progenitor cells are also decreased. These risk factors may impact disease severity and recovery from pneumonia caused by SARS-Cov-2 infection in older patients.


Assuntos
Enzima de Conversão de Angiotensina 2/genética , COVID-19 , Células-Tronco , Idoso , Criança , Humanos , Pulmão/citologia , Pessoa de Meia-Idade , RNA-Seq , Estudos Retrospectivos , Índice de Gravidade de Doença
7.
Environ Microbiol ; 23(5): 2339-2363, 2021 05.
Artigo em Inglês | MEDLINE | ID: mdl-33769683

RESUMO

The global propagation of SARS-CoV-2 and the detection of a large number of variants, some of which have replaced the original clade to become dominant, underscores the fact that the virus is actively exploring its evolutionary space. The longer high levels of viral multiplication occur - permitted by high levels of transmission -, the more the virus can adapt to the human host and find ways to success. The third wave of the COVID-19 pandemic is starting in different parts of the world, emphasizing that transmission containment measures that are being imposed are not adequate. Part of the consideration in determining containment measures is the rationale that vaccination will soon stop transmission and allow a return to normality. However, vaccines themselves represent a selection pressure for evolution of vaccine-resistant variants, so the coupling of a policy of permitting high levels of transmission/virus multiplication during vaccine roll-out with the expectation that vaccines will deal with the pandemic, is unrealistic. In the absence of effective antivirals, it is not improbable that SARS-CoV-2 infection prophylaxis will involve an annual vaccination campaign against 'dominant' viral variants, similar to influenza prophylaxis. Living with COVID-19 will be an issue of SARS-CoV-2 variants and evolution. It is therefore crucial to understand how SARS-CoV-2 evolves and what constrains its evolution, in order to anticipate the variants that will emerge. Thus far, the focus has been on the receptor-binding spike protein, but the virus is complex, encoding 26 proteins which interact with a large number of host factors, so the possibilities for evolution are manifold and not predictable a priori. However, if we are to mount the best defence against COVID-19, we must mount it against the variants, and to do this, we must have knowledge about the evolutionary possibilities of the virus. In addition to the generic cellular interactions of the virus, there are extensive polymorphisms in humans (e.g. Lewis, HLA, etc.), some distributed within most or all populations, some restricted to specific ethnic populations and these variations pose additional opportunities for/constraints on viral evolution. We now have the wherewithal - viral genome sequencing, protein structure determination/modelling, protein interaction analysis - to functionally characterize viral variants, but access to comprehensive genome data is extremely uneven. Yet, to develop an understanding of the impacts of such evolution on transmission and disease, we must link it to transmission (viral epidemiology) and disease data (patient clinical data), and the population granularities of these. In this editorial, we explore key facets of viral biology and the influence of relevant aspects of human polymorphisms, human behaviour, geography and climate and, based on this, derive a series of recommendations to monitor viral evolution and predict the types of variants that are likely to arise.


Assuntos
Evolução Biológica , COVID-19/prevenção & controle , COVID-19/virologia , SARS-CoV-2/genética , COVID-19/epidemiologia , COVID-19/genética , Transmissão de Doença Infecciosa/prevenção & controle , Variação Genética , Interações Hospedeiro-Patógeno , Humanos , SARS-CoV-2/fisiologia , Replicação Viral
8.
BMC Genomics ; 21(Suppl 6): 500, 2020 Dec 21.
Artigo em Inglês | MEDLINE | ID: mdl-33349238

RESUMO

BACKGROUND: Next-generation sequencing (NGS) enables unbiased detection of pathogens by mapping the sequencing reads of a patient sample to the known reference sequence of bacteria and viruses. However, for a new pathogen without a reference sequence of a close relative, or with a high load of mutations compared to its predecessors, read mapping fails due to a low similarity between the pathogen and reference sequence, which in turn leads to insensitive and inaccurate pathogen detection outcomes. RESULTS: We developed MegaPath, which runs fast and provides high sensitivity in detecting new pathogens. In MegaPath, we have implemented and tested a combination of polishing techniques to remove non-informative human reads and spurious alignments. MegaPath applies a global optimization to the read alignments and reassigns the reads incorrectly aligned to multiple species to a unique species. The reassignment not only significantly increased the number of reads aligned to distant pathogens, but also significantly reduced incorrect alignments. MegaPath implements an enhanced maximum-exact-match prefix seeding strategy and a SIMD-accelerated Smith-Waterman algorithm to run fast. CONCLUSIONS: In our benchmarks, MegaPath demonstrated superior sensitivity by detecting eight times more reads from a low-similarity pathogen than other tools. Meanwhile, MegaPath ran much faster than the other state-of-the-art alignment-based pathogen detection tools (and compariable with the less sensitivity profile-based pathogen detection tools). The running time of MegaPath is about 20 min on a typical 1 Gb dataset.


Assuntos
Metagenômica , Software , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Metagenoma , Alinhamento de Sequência , Análise de Sequência de DNA
9.
Bioinformatics ; 34(21): 3744-3746, 2018 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-29771282

RESUMO

Summary: AC-DIAMOND (v1) is a DNA-protein alignment tool designed to tackle the efficiency challenge of aligning large amount of reads or contigs to protein databases. When compared with the previously most efficient method DIAMOND, AC-DIAMOND gains a 6- to 7-fold speed-up, while retaining a similar degree of sensitivity. The improvement is rooted at two aspects: first, using a compressed index of seeds with adaptive-length to speed-up the matching between query and reference sequences; second, adopting a compact form of dynamic programing to fully utilize the parallelism of the SIMD capability. Availability and implementation: Software source codes and binaries available at https://github.com/Maihj/AC-DIAMOND/. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Software , DNA , Bases de Dados de Proteínas , Proteínas , Análise de Sequência de DNA
10.
Plant Cell ; 27(6): 1595-604, 2015 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-26002866

RESUMO

Structural variations (SVs) represent a major source of genetic diversity. However, the functional impact and formation mechanisms of SVs in plant genomes remain largely unexplored. Here, we report a nucleotide-resolution SV map of cucumber (Cucumis sativas) that comprises 26,788 SVs based on deep resequencing of 115 diverse accessions. The largest proportion of cucumber SVs was formed through nonhomologous end-joining rearrangements, and the occurrence of SVs is closely associated with regions of high nucleotide diversity. These SVs affect the coding regions of 1676 genes, some of which are associated with cucumber domestication. Based on the map, we discovered a copy number variation (CNV) involving four genes that defines the Female (F) locus and gives rise to gynoecious cucumber plants, which bear only female flowers and set fruit at almost every node. The CNV arose from a recent 30.2-kb duplication at a meiotically unstable region, likely via microhomology-mediated break-induced replication. The SV set provides a snapshot of structural variations in plants and will serve as an important resource for exploring genes underlying key traits and for facilitating practical breeding in cucumber.


Assuntos
Cucumis sativus/genética , Variações do Número de Cópias de DNA/genética , Flores/genética , Mapeamento Cromossômico , Cucumis sativus/anatomia & histologia , Flores/anatomia & histologia , Genoma de Planta/genética , Estudo de Associação Genômica Ampla , Filogenia
11.
Nature ; 490(7418): 49-54, 2012 Oct 04.
Artigo em Inglês | MEDLINE | ID: mdl-22992520

RESUMO

The Pacific oyster Crassostrea gigas belongs to one of the most species-rich but genomically poorly explored phyla, the Mollusca. Here we report the sequencing and assembly of the oyster genome using short reads and a fosmid-pooling strategy, along with transcriptomes of development and stress response and the proteome of the shell. The oyster genome is highly polymorphic and rich in repetitive sequences, with some transposable elements still actively shaping variation. Transcriptome studies reveal an extensive set of genes responding to environmental stress. The expansion of genes coding for heat shock protein 70 and inhibitors of apoptosis is probably central to the oyster's adaptation to sessile life in the highly stressful intertidal zone. Our analyses also show that shell formation in molluscs is more complex than currently understood and involves extensive participation of cells and their exosomes. The oyster genome sequence fills a void in our understanding of the Lophotrochozoa.


Assuntos
Adaptação Fisiológica/genética , Exoesqueleto/crescimento & desenvolvimento , Crassostrea/genética , Genoma/genética , Estresse Fisiológico/fisiologia , Exoesqueleto/química , Animais , Proteínas Reguladoras de Apoptose/genética , Elementos de DNA Transponíveis/genética , Evolução Molecular , Feminino , Regulação da Expressão Gênica no Desenvolvimento/genética , Genes Homeobox/genética , Genômica , Proteínas de Choque Térmico HSP70/genética , Humanos , Larva/genética , Larva/crescimento & desenvolvimento , Espectrometria de Massas , Anotação de Sequência Molecular , Dados de Sequência Molecular , Polimorfismo Genético/genética , Sequências Repetitivas de Ácido Nucleico/genética , Análise de Sequência de DNA , Estresse Fisiológico/genética , Transcriptoma/genética
12.
BMC Bioinformatics ; 18(Suppl 12): 408, 2017 Oct 16.
Artigo em Inglês | MEDLINE | ID: mdl-29072142

RESUMO

BACKGROUND: The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of de Bruijn graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for improvement. Apart from its slow speed, Xander uses only 1 k-mer size for graph construction and whatever choice of k will compromise either sensitivity or accuracy. Xander uses a Bloom-filter representation of de Bruijn graph to achieve a lower memory footprint. Bloom filters bring in false positives, and it is not clear how this would impact the quality of assembly. Xander does not keep track of the multiplicity of k-mers, which would have been an effective way to differentiate between erroneous k-mers and correct k-mers. RESULTS: In this paper, we present a new gene-targeted assembler MegaGTA, which attempts to improve Xander in different aspects. Quality-wise, it utilizes iterative de Bruijn graphs to take full advantage of multiple k-mer sizes to make the best of both sensitivity and accuracy. Computation-wise, it employs succinct de Bruijn graphs (SdBG) to achieve low memory footprint and high speed (the latter is benefited from a highly efficient parallel algorithm for constructing SdBG). Unlike Bloom filters, an SdBG is an exact representation of a de Bruijn graph. It enables MegaGTA to avoid false-positive contigs and to easily incorporate the multiplicity of k-mers for building better HMM model. We have compared MegaGTA and Xander on an HMP-defined mock metagenomic dataset, and showed that MegaGTA excelled in both sensitivity and accuracy. On a large rhizosphere soil metagenomic sample (327Gbp), MegaGTA produced 9.7-19.3% more contigs than Xander, and these contigs were assigned to 10-25% more gene references. In our experiments, MegaGTA, depending on the number of k-mers used, is two to ten times faster than Xander. CONCLUSION: MegaGTA improves on the algorithm of Xander and achieves higher sensitivity, accuracy and speed. Moreover, it is capable of assembling gene sequences from ultra-large metagenomic datasets. Its source code is freely available at https://github.com/HKU-BAL/megagta .


Assuntos
Algoritmos , Genes , Metagenômica/métodos , Software , Bases de Dados Genéticas , Humanos , Projetos Piloto , Padrões de Referência , Rizosfera , Solo , Estatística como Assunto
13.
Methods ; 102: 3-11, 2016 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-27012178

RESUMO

The study of metagenomics has been much benefited from low-cost and high-throughput sequencing technologies, yet the tremendous amount of data generated make analysis like de novo assembly to consume too much computational resources. In late 2014 we released MEGAHIT v0.1 (together with a brief note of Li et al. (2015) [1]), which is the first NGS metagenome assembler that can assemble genome sequences from metagenomic datasets of hundreds of Giga base-pairs (bp) in a time- and memory-efficient manner on a single server. The core of MEGAHIT is an efficient parallel algorithm for constructing succinct de Bruijn Graphs (SdBG), implemented on a graphical processing unit (GPU). The software has been well received by the assembly community, and there is interest in how to adapt the algorithms to integrate popular assembly practices so as to improve the assembly quality, as well as how to speed up the software using better CPU-based algorithms (instead of GPU). In this paper we first describe the details of the core algorithms in MEGAHIT v0.1, and then we show the new modules to upgrade MEGAHIT to version v1.0, which gives better assembly quality, runs faster and uses less memory. For the Iowa Prairie Soil dataset (252Gbp after quality trimming), the assembly quality of MEGAHIT v1.0, when compared with v0.1, has a significant improvement, namely, 36% increase in assembly size and 23% in N50. More interestingly, MEGAHIT v1.0 is no slower than before (even running with the extra modules). This is primarily due to a new CPU-based algorithm for SdBG construction that is faster and requires less memory. Using CPU only, MEGAHIT v1.0 can assemble the Iowa Prairie Soil sample in about 43h, reducing the running time of v0.1 by at least 25% and memory usage by up to 50%. MEGAHIT v1.0, exhibiting a smaller memory footprint, can process even larger datasets. The Kansas Prairie Soil sample (484Gbp), the largest publicly available dataset, can now be assembled using no more than 500GB of memory in 7.5days. The assemblies of these datasets (and other large metgenomic datasets), as well as the software, are available at the website https://hku-bal.github.io/megabox.


Assuntos
Metagenoma , Análise de Sequência/métodos , Software , Algoritmos , Conjuntos de Dados como Assunto , Metagenômica/métodos , Solo
14.
Nature ; 470(7332): 59-65, 2011 Feb 03.
Artigo em Inglês | MEDLINE | ID: mdl-21293372

RESUMO

Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.


Assuntos
Variações do Número de Cópias de DNA/genética , Genética Populacional , Genoma Humano/genética , Genômica , Duplicação Gênica/genética , Predisposição Genética para Doença/genética , Genótipo , Humanos , Mutagênese Insercional/genética , Reprodutibilidade dos Testes , Análise de Sequência de DNA , Deleção de Sequência/genética
15.
BMC Genomics ; 17 Suppl 5: 499, 2016 08 31.
Artigo em Inglês | MEDLINE | ID: mdl-27586129

RESUMO

BACKGROUND: De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads. METHODS: This paper shows a new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. RESULTS: Experiments on two bacteria and four human datasets shows the advantage of BASE in both contig quality and speed in dealing with longer reads. In the experiment on bacteria, two datasets with read length of 100 bp and 250 bp were used.. Especially for the 250 bp dataset, BASE gives much better quality than SOAPdenovo2 and SGA and is simlilar to SPAdes. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. BASE and Soapdenov2 are further compared using human datasets with read length 100 bp, 150 bp and 250 bp. BASE shows a higher N50 for all datasets, while the improvement becomes more significant when read length reaches 250 bp. Besides, BASE is more-meory efficent than SOAPdenovo2 when sequencing data with error rate. CONCLUSIONS: BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA/métodos , Algoritmos , Humanos , Software , Staphylococcus aureus/genética , Vibrio parahaemolyticus/genética
16.
Bioinformatics ; 31(10): 1674-6, 2015 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-25609793

RESUMO

MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252 Gbps in 44.1 and 99.6 h on a single computing node with and without a graphics processing unit, respectively. MEGAHIT assembles the data as a whole, i.e. no pre-processing like partitioning and normalization was needed. When compared with previous methods on assembling the soil data, MEGAHIT generated a three-time larger assembly, with longer contig N50 and average contig length; furthermore, 55.8% of the reads were aligned to the assembly, giving a fourfold improvement.


Assuntos
Metagenômica/métodos , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Software , Solo
17.
Bioinformatics ; 31(24): 4035-7, 2015 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-26315902

RESUMO

UNLABELLED: Rapid advances of next-generation sequencing technology have led to the integration of genetic information with clinical care. Genetic basis of diseases and response to drugs provide new ways of disease diagnosis and safer drug usage. This integration reveals the urgent need for effective and accurate tools to analyze genetic variants. Due to the number and diversity of sources for annotation, automating variant analysis is a challenging task. Here, we present database.bio, a web application that combines variant annotation, prioritization and visualization so as to support insight into the individual genetic characteristics. It enhances annotation speed by preprocessing data on a supercomputer, and reduces database space via a unified database representation with compressed fields. AVAILABILITY AND IMPLEMENTATION: Freely available at https://database.bio.


Assuntos
Bases de Dados de Ácidos Nucleicos , Variação Genética , Software , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Internet , Anotação de Sequência Molecular
18.
BMC Bioinformatics ; 16 Suppl 7: S10, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25952019

RESUMO

BACKGROUND: Short-read aligners have recently gained a lot of speed by exploiting the massive parallelism of GPU. An uprising alterative to GPU is Intel MIC; supercomputers like Tianhe-2, currently top of TOP500, is built with 48,000 MIC boards to offer ~55 PFLOPS. The CPU-like architecture of MIC allows CPU-based software to be parallelized easily; however, the performance is often inferior to GPU counterparts as an MIC card contains only ~60 cores (while a GPU card typically has over a thousand cores). RESULTS: To better utilize MIC-enabled computers for NGS data analysis, we developed a new short-read aligner MICA that is optimized in view of MIC's limitation and the extra parallelism inside each MIC core. By utilizing the 512-bit vector units in the MIC and implementing a new seeding strategy, experiments on aligning 150 bp paired-end reads show that MICA using one MIC card is 4.9 times faster than BWA-MEM (using 6 cores of a top-end CPU), and slightly faster than SOAP3-dp (using a GPU). Furthermore, MICA's simplicity allows very efficient scale-up when multiple MIC cards are used in a node (3 cards give a 14.1-fold speedup over BWA-MEM). SUMMARY: MICA can be readily used by MIC-enabled supercomputers for production purpose. We have tested MICA on Tianhe-2 with 90 WGS samples (17.47 Tera-bases), which can be aligned in an hour using 400 nodes. MICA has impressive performance even though MIC is only in its initial stage of development. AVAILABILITY AND IMPLEMENTATION: MICA's source code is freely available at http://sourceforge.net/projects/mica-aligner under GPL v3. SUPPLEMENTARY INFORMATION: Supplementary information is available as "Additional File 1". Datasets are available at www.bio8.cs.hku.hk/dataset/mica.


Assuntos
Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Humanos , Linguagens de Programação
19.
Bioinformatics ; 30(17): 2498-500, 2014 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-24833803

RESUMO

UNLABELLED: Recent advances in high-throughput sequencing technologies have enabled us to sequence large number of cancer samples to reveal novel insights into oncogenetic mechanisms. However, the presence of intratumoral heterogeneity, normal cell contamination and insufficient sequencing depth, together pose a challenge for detecting somatic mutations. Here we propose a fast and an accurate somatic single-nucleotide variations (SNVs) detection program, FaSD-somatic. The performance of FaSD-somatic is extensively assessed on various types of cancer against several state-of-the-art somatic SNV detection programs. Benchmarked by somatic SNVs from either existing databases or de novo higher-depth sequencing data, FaSD-somatic has the best overall performance. Furthermore, FaSD-somatic is efficient, it finishes somatic SNV calling within 14 h on 50X whole genome sequencing data in paired samples. AVAILABILITY AND IMPLEMENTATION: The program, datasets and supplementary files are available at http://jjwanglab.org/FaSD-somatic/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Variação Genética , Sequenciamento de Nucleotídeos em Larga Escala , Neoplasias/genética , Bases de Dados de Ácidos Nucleicos , Genômica , Humanos
20.
Bioinformatics ; 30(12): 1660-6, 2014 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-24532719

RESUMO

MOTIVATION: Transcriptome sequencing has long been the favored method for quickly and inexpensively obtaining a large number of gene sequences from an organism with no reference genome. Owing to the rapid increase in throughputs and decrease in costs of next- generation sequencing, RNA-Seq in particular has become the method of choice. However, the very short reads (e.g. 2 × 90 bp paired ends) from next generation sequencing makes de novo assembly to recover complete or full-length transcript sequences an algorithmic challenge. RESULTS: Here, we present SOAPdenovo-Trans, a de novo transcriptome assembler designed specifically for RNA-Seq. We evaluated its performance on transcriptome datasets from rice and mouse. Using as our benchmarks the known transcripts from these well-annotated genomes (sequenced a decade ago), we assessed how SOAPdenovo-Trans and two other popular transcriptome assemblers handled such practical issues as alternative splicing and variable expression levels. Our conclusion is that SOAPdenovo-Trans provides higher contiguity, lower redundancy and faster execution. AVAILABILITY AND IMPLEMENTATION: Source code and user manual are available at http://sourceforge.net/projects/soapdenovotrans/.


Assuntos
Algoritmos , Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA/métodos , Processamento Alternativo , Animais , Genômica/métodos , Camundongos , Oryza/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA