Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 60
Filtrar
1.
Bioinformatics ; 40(3)2024 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-38377404

RESUMO

MOTIVATION: Seeding is a rate-limiting stage in sequence alignment for next-generation sequencing reads. The existing optimization algorithms typically utilize hardware and machine-learning techniques to accelerate seeding. However, an efficient solution provided by professional next-generation sequencing compressors has been largely overlooked by far. In addition to achieving remarkable compression ratios by reordering reads, these compressors provide valuable insights for downstream alignment that reveal the repetitive computations accounting for more than 50% of seeding procedure in commonly used short read aligner BWA-MEM at typical sequencing coverage. Nevertheless, the exploited redundancy information is not fully realized or utilized. RESULTS: In this study, we present a compressive seeding algorithm, named CompSeed, to fill the gap. CompSeed, in collaboration with the existing reordering-based compression tools, finishes the BWA-MEM seeding process in about half the time by caching all intermediate seeding results in compact trie structures to directly answer repetitive inquiries that frequently cause random memory accesses. Furthermore, CompSeed demonstrates better performance as sequencing coverage increases, as it focuses solely on the small informative portion of sequencing reads after compression. The innovative strategy highlights the promising potential of integrating sequence compression and alignment to tackle the ever-growing volume of sequencing data. AVAILABILITY AND IMPLEMENTATION: CompSeed is available at https://github.com/i-xiaohu/CompSeed.


Assuntos
Compressão de Dados , Software , Análise de Sequência de DNA/métodos , Algoritmos , Compressão de Dados/métodos , Computadores , Sequenciamento de Nucleotídeos em Larga Escala/métodos
2.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-35901464

RESUMO

MOTIVATION: The associations between biomarkers and human diseases play a key role in understanding complex pathology and developing targeted therapies. Wet lab experiments for biomarker discovery are costly, laborious and time-consuming. Computational prediction methods can be used to greatly expedite the identification of candidate biomarkers. RESULTS: Here, we present a novel computational model named GTGenie for predicting the biomarker-disease associations based on graph and text features. In GTGenie, a graph attention network is utilized to characterize diverse similarities of biomarkers and diseases from heterogeneous information resources. Meanwhile, a pretrained BERT-based model is applied to learn the text-based representation of biomarker-disease relation from biomedical literature. The captured graph and text features are then integrated in a bimodal fusion network to model the hybrid entity representation. Finally, inductive matrix completion is adopted to infer the missing entries for reconstructing relation matrix, with which the unknown biomarker-disease associations are predicted. Experimental results on HMDD, HMDAD and LncRNADisease data sets showed that GTGenie can obtain competitive prediction performance with other state-of-the-art methods. AVAILABILITY: The source code of GTGenie and the test data are available at: https://github.com/Wolverinerine/GTGenie.


Assuntos
Biologia Computacional , Software , Biologia Computacional/métodos , Humanos
3.
Bioinformatics ; 39(8)2023 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-37527015

RESUMO

MOTIVATION: The interactions between T-cell receptors (TCR) and peptide-major histocompatibility complex (pMHC) are essential for the adaptive immune system. However, identifying these interactions can be challenging due to the limited availability of experimental data, sequence data heterogeneity, and high experimental validation costs. RESULTS: To address this issue, we develop a novel computational framework, named MIX-TPI, to predict TCR-pMHC interactions using amino acid sequences and physicochemical properties. Based on convolutional neural networks, MIX-TPI incorporates sequence-based and physicochemical-based extractors to refine the representations of TCR-pMHC interactions. Each modality is projected into modality-invariant and modality-specific representations to capture the uniformity and diversities between different features. A self-attention fusion layer is then adopted to form the classification module. Experimental results demonstrate the effectiveness of MIX-TPI in comparison with other state-of-the-art methods. MIX-TPI also shows good generalization capability on mutual exclusive evaluation datasets and a paired TCR dataset. AVAILABILITY AND IMPLEMENTATION: The source code of MIX-TPI and the test data are available at: https://github.com/Wolverinerine/MIX-TPI.


Assuntos
Complexo Principal de Histocompatibilidade , Peptídeos , Peptídeos/química , Receptores de Antígenos de Linfócitos T/genética , Sequência de Aminoácidos , Software , Ligação Proteica
4.
Bioinformatics ; 38(12): 3294-3296, 2022 06 13.
Artigo em Inglês | MEDLINE | ID: mdl-35579371

RESUMO

MOTIVATION: The data deluge of high-throughput sequencing (HTS) has posed great challenges to data storage and transfer. Many specific compression tools have been developed to solve this problem. However, most of the existing compressors are based on central processing unit (CPU) platform, which might be inefficient and expensive to handle large-scale HTS data. With the popularization of graphics processing units (GPUs), GPU-compatible sequencing data compressors become desirable to exploit the computing power of GPUs. RESULTS: We present a GPU-accelerated reference-free read compressor, namely CURC, for FASTQ files. Under a GPU-CPU heterogeneous parallel scheme, CURC implements highly efficient lossless compression of DNA stream based on the pseudogenome approach and CUDA library. CURC achieves 2-6-fold speedup of the compression with competitive compression rate, compared with other state-of-the-art reference-free read compressors. AVAILABILITY AND IMPLEMENTATION: CURC can be downloaded from https://github.com/BioinfoSZU/CURC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Compressão de Dados , Análise de Sequência de DNA , Sequenciamento de Nucleotídeos em Larga Escala , Biblioteca Gênica
5.
Small ; 18(30): e2202434, 2022 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-35775979

RESUMO

Pre-catalyst reconstruction in electrochemical processes has recently attracted intensive attention with mechanistic potentials to uncover really active species and catalytic mechanisms and advance targeted catalyst designs. Here, nickel-molybdenum oxysulfide is deliberately fabricated as pre-catalyst to present a comprehensive study on reconstruction dynamics for the oxygen evolution reaction (OER) and hydrogen evolution reaction (HER) in alkali water electrolysis. Operando Raman spectroscopy together with X-ray photoelectron spectroscopy and electron microscopy capture dynamic reconstruction including geometric, component and phase evolutions, revealing a chameleon-like reconstruction self-adaptive to OER and HER demands under oxidative and reductive conditions, respectively. The in situ generated active NiOOH and Ni species with ultrafine and porous textures exhibit superior OER and HER performance, respectively, and an electrolyzer with such two reconstructed electrodes demonstrates steady overall water splitting with an extraordinary 80% electricity-to-hydrogen (ETH) energy conversion efficiency. This work highlights dynamic reconstruction adaptability to electrochemical conditions and develops an automatic avenue toward the targeted design of advanced catalysts.

6.
Small ; 17(38): e2101671, 2021 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-34342939

RESUMO

Most transition metal-based catalysts for electrocatalytic oxygen evolution reaction (OER) undergo surface reconstruction to generate real active sites favorable for high OER performance. Herein, how to use self-reconstruction as an efficient strategy to develop novel and robust OER catalysts by designing pre-catalysts with flexible components susceptible to OER conditions is proposed. The NiFe-based layered double hydroxides (LDHs) intercalated with resoluble molybdate (MoO4 2- ) anions in interlayers are constructed and then demonstrated to achieve complete electrochemical self-reconstruction (ECSR) into active NiFe-oxyhydroxides (NiFeOOH) beneficial to alkaline OER. Various ex situ and in situ techniques are used to capture structural evolution process including fast dissolution of MoO4 2- and deep reconstruction to NiFeOOH upon simultaneous hydroxyl invasion and electro-oxidation. The obtained NiFeOOH exhibits an excellent OER performance with an overpotential of only 268 mV at 50 mA cm-1 and robust durability over 45 h, much superior to NiFe-LDH and commercial IrO2 benchmark. This work suggests that the ECSR engineering in component-flexible precursors is a promising strategy to develop highly active OER catalysts for energy conversion.

7.
Bioinformatics ; 36(2): 578-585, 2020 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-31368481

RESUMO

MOTIVATION: Inferring gene regulatory networks from gene expression time series data is important for gaining insights into the complex processes of cell life. A popular approach is to infer Boolean networks. However, it is still a pressing open problem to infer accurate Boolean networks from experimental data that are typically short and noisy. RESULTS: To address the problem, we propose a Boolean network inference algorithm which is able to infer accurate Boolean network topology and dynamics from short and noisy time series data. The main idea is that, for each target gene, we use an And/Or tree ensemble algorithm to select prime implicants of which each is a conjunction of a set of input genes. The selected prime implicants are important features for predicting the states of the target gene. Using these important features we then infer the Boolean function of the target gene. Finally, the Boolean functions of all target genes are combined as a Boolean network. Using the data generated from artificial and real-world gene regulatory networks, we show that our algorithm can infer more accurate Boolean network topology and dynamics from short and noisy time series data than other algorithms. Our algorithm enables us to gain better insights into complex regulatory mechanisms of cell life. AVAILABILITY AND IMPLEMENTATION: Package ATEN is freely available at https://github.com/ningshi/ATEN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Reguladoras de Genes , Árvores , Algoritmos , Expressão Gênica
8.
Entropy (Basel) ; 23(11)2021 Oct 25.
Artigo em Inglês | MEDLINE | ID: mdl-34828096

RESUMO

Convolutional Neural Networks (CNNs) have been widely used in video super-resolution (VSR). Most existing VSR methods focus on how to utilize the information of multiple frames, while neglecting the feature correlations of the intermediate features, thus limiting the feature expression of the models. To address this problem, we propose a novel SAA network, that is, Scale-and-Attention-Aware Networks, to apply different attention to different temporal-length streams, while further exploring both spatial and channel attention on separate streams with a newly proposed Criss-Cross Channel Attention Module (C3AM). Experiments on public VSR datasets demonstrate the superiority of our method over other state-of-the-art methods in terms of both quantitative and qualitative metrics.

9.
BMC Bioinformatics ; 20(Suppl 19): 657, 2019 Dec 24.
Artigo em Inglês | MEDLINE | ID: mdl-31870274

RESUMO

BACKGROUND: Synthetic lethality has attracted a lot of attentions in cancer therapeutics due to its utility in identifying new anticancer drug targets. Identifying synthetic lethal (SL) interactions is the key step towards the exploration of synthetic lethality in cancer treatment. However, biological experiments are faced with many challenges when identifying synthetic lethal interactions. Thus, it is necessary to develop computational methods which could serve as useful complements to biological experiments. RESULTS: In this paper, we propose a novel graph regularized self-representative matrix factorization (GRSMF) algorithm for synthetic lethal interaction prediction. GRSMF first learns the self-representations from the known SL interactions and further integrates the functional similarities among genes derived from Gene Ontology (GO). It can then effectively predict potential SL interactions by leveraging the information provided by known SL interactions and functional annotations of genes. Extensive experiments on the synthetic lethal interaction data downloaded from SynLethDB database demonstrate the superiority of our GRSMF in predicting potential synthetic lethal interactions, compared with other competing methods. Moreover, case studies of novel interactions are conducted in this paper for further evaluating the effectiveness of GRSMF in synthetic lethal interaction prediction. CONCLUSIONS: In this paper, we demonstrate that by adaptively exploiting the self-representation of original SL interaction data, and utilizing functional similarities among genes to enhance the learning of self-representation matrix, our GRSMF could predict potential SL interactions more accurately than other state-of-the-art SL interaction prediction methods.


Assuntos
Neoplasias/genética , Algoritmos , Antineoplásicos/uso terapêutico , Humanos , Neoplasias/tratamento farmacológico
10.
Bioinformatics ; 34(7): 1099-1107, 2018 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-29126180

RESUMO

Motivation: The identification of repetitive elements is important in genome assembly and phylogenetic analyses. The existing de novo repeat identification methods exploiting the use of short reads are impotent in identifying long repeats. Since long reads are more likely to cover repeat regions completely, using long reads is more favorable for recognizing long repeats. Results: In this study, we propose a novel de novo repeat elements identification method namely RepLong based on PacBio long reads. Given that the reads mapped to the repeat regions are highly overlapped with each other, the identification of repeat elements is equivalent to the discovery of consensus overlaps between reads, which can be further cast into a community detection problem in the network of read overlaps. In RepLong, we first construct a network of read overlaps based on pair-wise alignment of the reads, where each vertex indicates a read and an edge indicates a substantial overlap between the corresponding two reads. Secondly, the communities whose intra connectivity is greater than the inter connectivity are extracted based on network modularity optimization. Finally, representative reads in each community are extracted to form the repeat library. Comparison studies on Drosophila melanogaster and human long read sequencing data with genome-based and short-read-based methods demonstrate the efficiency of RepLong in identifying long repeats. RepLong can handle lower coverage data and serve as a complementary solution to the existing methods to promote the repeat identification performance on long-read sequencing data. Availability and implementation: The software of RepLong is freely available at https://github.com/ruiguo-bio/replong. Contact: ywsun@szu.edu.cn or zhuzx@szu.edu.cn. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Filogenia , Sequências Repetitivas de Ácido Nucleico , Análise de Sequência de DNA/métodos , Software , Algoritmos , Animais , Drosophila melanogaster/genética , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos
11.
PLoS Comput Biol ; 13(3): e1005455, 2017 03.
Artigo em Inglês | MEDLINE | ID: mdl-28339468

RESUMO

In the recent few years, an increasing number of studies have shown that microRNAs (miRNAs) play critical roles in many fundamental and important biological processes. As one of pathogenetic factors, the molecular mechanisms underlying human complex diseases still have not been completely understood from the perspective of miRNA. Predicting potential miRNA-disease associations makes important contributions to understanding the pathogenesis of diseases, developing new drugs, and formulating individualized diagnosis and treatment for diverse human complex diseases. Instead of only depending on expensive and time-consuming biological experiments, computational prediction models are effective by predicting potential miRNA-disease associations, prioritizing candidate miRNAs for the investigated diseases, and selecting those miRNAs with higher association probabilities for further experimental validation. In this study, Path-Based MiRNA-Disease Association (PBMDA) prediction model was proposed by integrating known human miRNA-disease associations, miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity for miRNAs and diseases. This model constructed a heterogeneous graph consisting of three interlinked sub-graphs and further adopted depth-first search algorithm to infer potential miRNA-disease associations. As a result, PBMDA achieved reliable performance in the frameworks of both local and global LOOCV (AUCs of 0.8341 and 0.9169, respectively) and 5-fold cross validation (average AUC of 0.9172). In the cases studies of three important human diseases, 88% (Esophageal Neoplasms), 88% (Kidney Neoplasms) and 90% (Colon Neoplasms) of top-50 predicted miRNAs have been manually confirmed by previous experimental reports from literatures. Through the comparison performance between PBMDA and other previous models in case studies, the reliable performance also demonstrates that PBMDA could serve as a powerful computational tool to accelerate the identification of disease-miRNA associations.


Assuntos
Biomarcadores Tumorais/genética , Estudos de Associação Genética , MicroRNAs/genética , Modelos Estatísticos , Neoplasias/epidemiologia , Neoplasias/genética , Simulação por Computador , Predisposição Genética para Doença/epidemiologia , Predisposição Genética para Doença/genética , Humanos , Modelos Genéticos , Prevalência , Prognóstico , Medição de Risco/métodos , Fatores de Risco , Transdução de Sinais/genética
12.
BMC Bioinformatics ; 18(1): 179, 2017 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-28320326

RESUMO

BACKGROUND: The rapid progress of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing, which leads to revolutionary advances in gene industry. The explosively increasing volume of raw data outpaces the decreasing disk cost and the storage of huge sequencing data has become a bottleneck of downstream analyses. Data compression is considered as a solution to reduce the dependency on storage. Efficient sequencing data compression methods are highly demanded. RESULTS: In this article, we present a lossless reference-based compression method namely LW-FQZip 2 targeted at FASTQ files. LW-FQZip 2 is improved from LW-FQZip 1 by introducing more efficient coding scheme and parallelism. Particularly, LW-FQZip 2 is equipped with a light-weight mapping model, bitwise prediction by partial matching model, arithmetic coding, and multi-threading parallelism. LW-FQZip 2 is evaluated on both short-read and long-read data generated from various sequencing platforms. The experimental results show that LW-FQZip 2 is able to obtain promising compression ratios at reasonable time and memory space costs. CONCLUSIONS: The competence enables LW-FQZip 2 to serve as a candidate tool for archival or space-sensitive applications of high-throughput DNA sequencing data. LW-FQZip 2 is freely available at http://csse.szu.edu.cn/staff/zhuzx/LWFQZip2 and https://github.com/Zhuzxlab/LW-FQZip2 .


Assuntos
Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos
13.
BMC Genomics ; 18(Suppl 2): 209, 2017 03 14.
Artigo em Inglês | MEDLINE | ID: mdl-28361692

RESUMO

BACKGROUND: Active modules are connected regions in biological network which show significant changes in expression over particular conditions. The identification of such modules is important since it may reveal the regulatory and signaling mechanisms that associate with a given cellular response. RESULTS: In this paper, we propose a novel active module identification algorithm based on a memetic algorithm. We propose a novel encoding/decoding scheme to ensure the connectedness of the identified active modules. Based on the scheme, we also design and incorporate a local search operator into the memetic algorithm to improve its performance. CONCLUSION: The effectiveness of proposed algorithm is validated on both small and large protein interaction networks.


Assuntos
Algoritmos , Redes Reguladoras de Genes , Mapas de Interação de Proteínas , Humanos , Mapeamento de Interação de Proteínas/estatística & dados numéricos , Transdução de Sinais
14.
Brief Bioinform ; 16(1): 1-15, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24300111

RESUMO

The exponential growth of high-throughput DNA sequence data has posed great challenges to genomic data storage, retrieval and transmission. Compression is a critical tool to address these challenges, where many methods have been developed to reduce the storage size of the genomes and sequencing data (reads, quality scores and metadata). However, genomic data are being generated faster than they could be meaningfully analyzed, leaving a large scope for developing novel compression algorithms that could directly facilitate data analysis beyond data transfer and storage. In this article, we categorize and provide a comprehensive review of the existing compression methods specialized for genomic data and present experimental results on compression ratio, memory usage, time for compression and decompression. We further present the remaining challenges and potential directions for future research.


Assuntos
Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Dados de Sequência Molecular
15.
Opt Express ; 25(3): 1723-1731, 2017 Feb 06.
Artigo em Inglês | MEDLINE | ID: mdl-29519026

RESUMO

The conductivity of poly(3,4-ethylene dioxythiophene)/poly(4-styrenesulfonate) (PEDOT/PSS) is significantly enhanced on adding some organic solvent such as ethylene glycol (EG). In this paper, the optoelectronic properties of EG doped PEDOT/PSS on transmission and anti-reflection effects are investigated in detail by terahertz time domain spectroscopy (THz-TDS). The transmission line circuit theory gives us an insight into the THz transmission mechanisms of the main and second pulses. In particular, we show that the conductivities of 10% EG doped PEDOT/PSS are nearly frequency independent from 0.3 to 1.5 THz. To demonstrate applications of this property, we design and fabricate broadband terahertz neutral density filters and anti-reflection coatings based on 10% EG doped PEDOT/PSS thin films with varying thickness. Our measurements highlight the capability of THz-TDS to characterize the conductivity of EG doped PEDOT/PSS, which is essential for broadband optoelectronic devices in THz region.

16.
Bioinformatics ; 31(3): 426-8, 2015 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-25282641

RESUMO

SUMMARY: Exhaustive mapping of next-generation sequencing data to a set of relevant reference sequences becomes an important task in pathogen discovery and metagenomic classification. However, the runtime and memory usage increase as the number of reference sequences and the repeat content among these sequences increase. In many applications, read mapping time dominates the entire application. We developed CompMap, a reference-based compression program, to speed up this process. CompMap enables the generation of a non-redundant representative sequence for the input sequences. We have demonstrated that reads can be mapped to this representative sequence with a much reduced time and memory usage, and the mapping to the original reference sequences can be recovered with high accuracy. AVAILABILITY AND IMPLEMENTATION: CompMap is implemented in C and freely available at http://csse.szu.edu.cn/staff/zhuzx/CompMap/. CONTACT: xiaoyang@broadinstitute.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados/métodos , Genoma Humano/genética , Sequenciamento de Nucleotídeos em Larga Escala , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Biologia Computacional , Humanos , Valores de Referência
17.
BMC Bioinformatics ; 16: 188, 2015 Jun 09.
Artigo em Inglês | MEDLINE | ID: mdl-26051252

RESUMO

BACKGROUND: The exponential growth of next generation sequencing (NGS) data has posed big challenges to data storage, management and archive. Data compression is one of the effective solutions, where reference-based compression strategies can typically achieve superior compression ratios compared to the ones not relying on any reference. RESULTS: This paper presents a lossless light-weight reference-based compression algorithm namely LW-FQZip to compress FASTQ data. The three components of any given input, i.e., metadata, short reads and quality score strings, are first parsed into three data streams in which the redundancy information are identified and eliminated independently. Particularly, well-designed incremental and run-length-limited encoding schemes are utilized to compress the metadata and quality score streams, respectively. To handle the short reads, LW-FQZip uses a novel light-weight mapping model to fast map them against external reference sequence(s) and produce concise alignment results for storage. The three processed data streams are then packed together with some general purpose compression algorithms like LZMA. LW-FQZip was evaluated on eight real-world NGS data sets and achieved compression ratios in the range of 0.111-0.201. This is comparable or superior to other state-of-the-art lossless NGS data compression algorithms. CONCLUSIONS: LW-FQZip is a program that enables efficient lossless FASTQ data compression. It contributes to the state of art applications for NGS data storage and transmission. LW-FQZip is freely available online at: http://csse.szu.edu.cn/staff/zhuzx/LWFQZip.


Assuntos
Algoritmos , Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Humanos
18.
Bioinformatics ; 30(4): 581-3, 2014 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-24336413

RESUMO

SUMMARY: Experimental MS(n) mass spectral libraries currently do not adequately cover chemical space. This limits the robust annotation of metabolites in metabolomics studies of complex biological samples. In silico fragmentation libraries would improve the identification of compounds from experimental multistage fragmentation data when experimental reference data are unavailable. Here, we present a freely available software package to automatically control Mass Frontier software to construct in silico mass spectral libraries and to perform spectral matching. Based on two case studies, we have demonstrated that high-throughput automation of Mass Frontier allows researchers to generate in silico mass spectral libraries in an automated and high-throughput fashion with little or no human intervention required. AVAILABILITY AND IMPLEMENTATION: Documentation, examples, results and source code are available at http://www.biosciences-labs.bham.ac.uk/viant/hammer/.


Assuntos
Espectrometria de Massas/métodos , Metabolômica , Reconhecimento Automatizado de Padrão , Preparações Farmacêuticas/análise , Fenilalanina/metabolismo , Software , Simulação por Computador
19.
BMC Bioinformatics ; 15 Suppl 15: S10, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25474747

RESUMO

BACKGROUND: The exponential growth of next-generation sequencing (NGS) derived DNA data poses great challenges to data storage and transmission. Although many compression algorithms have been proposed for DNA reads in NGS data, few methods are designed specifically to handle the quality scores. RESULTS: In this paper we present a memetic algorithm (MA) based NGS quality score data compressor, namely MMQSC. The algorithm extracts raw quality score sequences from FASTQ formatted files, and designs compression codebook using MA based multimodal optimization. The input data is then compressed in a substitutional manner. Experimental results on five representative NGS data sets show that MMQSC obtains higher compression ratio than the other state-of-the-art methods. Particularly, MMQSC is a lossless reference-free compression algorithm, yet obtains an average compression ratio of 22.82% on the experimental data sets. CONCLUSIONS: The proposed MMQSC compresses NGS quality score data effectively. It can be utilized to improve the overall compression ratio on FASTQ formatted files.


Assuntos
Algoritmos , Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos
20.
Nat Commun ; 15(1): 3126, 2024 Apr 11.
Artigo em Inglês | MEDLINE | ID: mdl-38605047

RESUMO

Long reads that cover more variants per read raise opportunities for accurate haplotype construction, whereas the genotype errors of single nucleotide polymorphisms pose great computational challenges for haplotyping tools. Here we introduce KSNP, an efficient haplotype construction tool based on the de Bruijn graph (DBG). KSNP leverages the ability of DBG in handling high-throughput erroneous reads to tackle the challenges. Compared to other notable tools in this field, KSNP achieves at least 5-fold speedup while producing comparable haplotype results. The time required for assembling human haplotypes is reduced to nearly the data-in time.


Assuntos
Algoritmos , Polimorfismo de Nucleotídeo Único , Humanos , Haplótipos/genética , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA