Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 97
Filter
Add more filters

Country/Region as subject
Publication year range
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38581420

ABSTRACT

Protein-ligand interaction prediction presents a significant challenge in drug design. Numerous machine learning and deep learning (DL) models have been developed to accurately identify docking poses of ligands and active compounds against specific targets. However, current models often suffer from inadequate accuracy or lack practical physical significance in their scoring systems. In this research paper, we introduce IGModel, a novel approach that utilizes the geometric information of protein-ligand complexes as input for predicting the root mean square deviation of docking poses and the binding strength (pKd, the negative value of the logarithm of binding affinity) within the same prediction framework. This ensures that the output scores carry intuitive meaning. We extensively evaluate the performance of IGModel on various docking power test sets, including the CASF-2016 benchmark, PDBbind-CrossDocked-Core and DISCO set, consistently achieving state-of-the-art accuracies. Furthermore, we assess IGModel's generalizability and robustness by evaluating it on unbiased test sets and sets containing target structures generated by AlphaFold2. The exceptional performance of IGModel on these sets demonstrates its efficacy. Additionally, we visualize the latent space of protein-ligand interactions encoded by IGModel and conduct interpretability analysis, providing valuable insights. This study presents a novel framework for DL-based prediction of protein-ligand interactions, contributing to the advancement of this field. The IGModel is available at GitHub repository https://github.com/zchwang/IGModel.


Subject(s)
Deep Learning , Proteins , Proteins/chemistry , Protein Binding , Ligands , Drug Design
2.
Brief Bioinform ; 24(6)2023 09 22.
Article in English | MEDLINE | ID: mdl-37833842

ABSTRACT

Recent studies have shed light on the potential of circular RNA (circRNA) as a biomarker for disease diagnosis and as a nucleic acid vaccine. The exploration of these functionalities requires correct circRNA full-length sequences; however, existing assembly tools can only correctly assemble some circRNAs, and their performance can be further improved. Here, we introduce a novel feature known as the junction contig (JC), which is an extension of the back-splice junction (BSJ). Leveraging the strengths of both BSJ and JC, we present a novel method called JCcirc (https://github.com/cbbzhang/JCcirc). It enables efficient reconstruction of all types of circRNA full-length sequences and their alternative isoforms using splice graphs and fragment coverage. Our findings demonstrate the superiority of JCcirc over existing methods on human simulation datasets, and its average F1 score surpasses CircAST by 0.40 and outperforms both CIRI-full and circRNAfull by 0.13. For circRNAs below 400 bp, 400-800 bp, 800 bp-1200 bp and above 1200 bp, the correct assembly rates are 0.13, 0.09, 0.04 and 0.03 higher, respectively, than those achieved by existing methods. Moreover, JCcirc also outperforms existing assembly tools on other five model species datasets and real sequencing datasets. These results show that JCcirc is a robust tool for accurately assembling circRNA full-length sequences, laying the foundation for the functional analysis of circRNAs.


Subject(s)
RNA, Circular , RNA , Humans , RNA, Circular/genetics , Sequence Analysis, RNA/methods , Protein Isoforms/genetics , RNA/genetics
3.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36502369

ABSTRACT

The recently reported machine learning- or deep learning-based scoring functions (SFs) have shown exciting performance in predicting protein-ligand binding affinities with fruitful application prospects. However, the differentiation between highly similar ligand conformations, including the native binding pose (the global energy minimum state), remains challenging that could greatly enhance the docking. In this work, we propose a fully differentiable, end-to-end framework for ligand pose optimization based on a hybrid SF called DeepRMSD+Vina combined with a multi-layer perceptron (DeepRMSD) and the traditional AutoDock Vina SF. The DeepRMSD+Vina, which combines (1) the root mean square deviation (RMSD) of the docking pose with respect to the native pose and (2) the AutoDock Vina score, is fully differentiable; thus is capable of optimizing the ligand binding pose to the energy-lowest conformation. Evaluated by the CASF-2016 docking power dataset, the DeepRMSD+Vina reaches a success rate of 94.4%, which outperforms most reported SFs to date. We evaluated the ligand conformation optimization framework in practical molecular docking scenarios (redocking and cross-docking tasks), revealing the high potentialities of this framework in drug design and discovery. Structural analysis shows that this framework has the ability to identify key physical interactions in protein-ligand binding, such as hydrogen-bonding. Our work provides a paradigm for optimizing ligand conformations based on deep learning algorithms. The DeepRMSD+Vina model and the optimization framework are available at GitHub repository https://github.com/zchwang/DeepRMSD-Vina_Optimization.


Subject(s)
Deep Learning , Ligands , Molecular Docking Simulation , Proteins/chemistry , Drug Design , Protein Binding
4.
Methods ; 224: 35-46, 2024 Apr.
Article in English | MEDLINE | ID: mdl-38373678

ABSTRACT

Bivalent Smac mimetics have been shown to possess binding affinity and pro-apoptotic activity similar to or more potent than that of native Smac, a protein dimer able to neutralize the anti-apoptotic activity of an inhibitor of caspase enzymes, XIAP, which endows cancer cells with resistance to anticancer drugs. We design five new bivalent Smac mimetics, which are formed by various linkers tethering two diazabicyclic cores being the IAP binding motifs. We built in silico models of the five mimetics by the TwistDock workflow and evaluated their conformational tendency, which suggests that compound 3, whose linker is n-hexylene, possess the highest binding potency among the five. After synthesis of these compounds, their ability in tumour cell growth inhibition and apoptosis induction displayed in experiments with SK-OV-3 and MDA-MB-231 cancer cell lines confirms our prediction. Among the five mimetics, compound 3 displays promising pro-apoptotic activity and deserves further optimization.


Subject(s)
Antineoplastic Agents , Neoplasms , Humans , Inhibitor of Apoptosis Proteins/metabolism , Inhibitor of Apoptosis Proteins/pharmacology , X-Linked Inhibitor of Apoptosis Protein/metabolism , X-Linked Inhibitor of Apoptosis Protein/pharmacology , Antineoplastic Agents/pharmacology , Antineoplastic Agents/chemistry , Molecular Conformation , Apoptosis , Cell Line, Tumor
5.
Methods ; 2024 Jul 05.
Article in English | MEDLINE | ID: mdl-38972499

ABSTRACT

Molecular simulation (MD) is a crucial research domain within the life sciences, focusing on comprehending the mechanisms of biomolecular interactions at atomic scales. Protein simulation, as a critical subfield, often utilizes MD for implementation, with trajectory data play a pivotal role in drug discovery. The advancement of high-performance computing and deep learning technology becomes popular and critical to predict protein properties from vast trajectory data, posing challenges regarding data features extraction from the complicated simulation data and dimensionality reduction. Simultaneously, it is essential to provide a meaningful explanation of the biological mechanism behind dimensionality. To tackle this challenge, we propose a new unsupervised model named RevGraphVAMP to intelligently analyze the simulation trajectory. This model is based on the variational approach for Markov processes (VAMP) and integrates graph convolutional neural networks and physical constraint optimization to enhance the learning performance. Additionally, we introduce attention mechanism to assess the importance of key interaction region, facilitating the interpretation of molecular mechanism. In comparison to other VAMPNets models, our model showcases competitive performance, improved accuracy in state transition prediction, as demonstrated through its application to two public datasets and the Shank3-Rap1 complex, which is associated with autism spectrum disorder. Moreover, it enhanced dimensionality reduction discrimination across different substates and provides interpretable results for protein structural characterization.

6.
Brief Bioinform ; 23(4)2022 07 18.
Article in English | MEDLINE | ID: mdl-35724626

ABSTRACT

Deep learning is an artificial intelligence technique in which models express geometric transformations over multiple levels. This method has shown great promise in various fields, including drug development. The availability of public structure databases prompted the researchers to use generative artificial intelligence models to narrow down their search of the chemical space, a novel approach to chemogenomics and de novo drug development. In this study, we developed a strategy that combined an accelerated LSTM_Chem (long short-term memory for de novo compounds generation), dense fully convolutional neural network (DFCNN), and docking to generate a large number of de novo small molecular chemical compounds for given targets. To demonstrate its efficacy and applicability, six important targets that account for various human disorders were used as test examples. Moreover, using the M protease as a proof-of-concept example, we find that iteratively training with previously selected candidates can significantly increase the chance of obtaining novel compounds with higher and higher predicted binding affinities. In addition, we also check the potential benefit of obtaining reliable final de novo compounds with the help of MD simulation and metadynamics simulation. The generation of de novo compounds and the discovery of binders against various targets proposed here would be a practical and effective approach. Assessing the efficacy of these top de novo compounds with biochemical studies is promising to promote related drug development.


Subject(s)
Deep Learning , Artificial Intelligence , Computer Simulation , Drug Design , Humans , Neural Networks, Computer
7.
Brief Bioinform ; 23(3)2022 05 13.
Article in English | MEDLINE | ID: mdl-35289359

ABSTRACT

Scoring functions are important components in molecular docking for structure-based drug discovery. Traditional scoring functions, generally empirical- or force field-based, are robust and have proven to be useful for identifying hits and lead optimizations. Although multiple highly accurate deep learning- or machine learning-based scoring functions have been developed, their direct applications for docking and screening are limited. We describe a novel strategy to develop a reliable protein-ligand scoring function by augmenting the traditional scoring function Vina score using a correction term (OnionNet-SFCT). The correction term is developed based on an AdaBoost random forest model, utilizing multiple layers of contacts formed between protein residues and ligand atoms. In addition to the Vina score, the model considerably enhances the AutoDock Vina prediction abilities for docking and screening tasks based on different benchmarks (such as cross-docking dataset, CASF-2016, DUD-E and DUD-AD). Furthermore, our model could be combined with multiple docking applications to increase pose selection accuracies and screening abilities, indicating its wide usage for structure-based drug discoveries. Furthermore, in a reverse practice, the combined scoring strategy successfully identified multiple known receptors of a plant hormone. To summarize, the results show that the combination of data-driven model (OnionNet-SFCT) and empirical scoring function (Vina score) is a good scoring strategy that could be useful for structure-based drug discoveries and potentially target fishing in future.


Subject(s)
Drug Discovery , Proteins , Drug Discovery/methods , Ligands , Machine Learning , Molecular Docking Simulation , Protein Binding , Proteins/chemistry
8.
Methods ; 216: 39-50, 2023 08.
Article in English | MEDLINE | ID: mdl-37330158

ABSTRACT

Assessing the quality of sequencing data plays a crucial role in downstream data analysis. However, existing tools often achieve sub-optimal efficiency, especially when dealing with compressed files or performing complicated quality control operations such as over-representation analysis and error correction. We present RabbitQCPlus, an ultra-efficient quality control tool for modern multi-core systems. RabbitQCPlus uses vectorization, memory copy reduction, parallel (de)compression, and optimized data structures to achieve substantial performance gains. It is 1.1 to 5.4 times faster when performing basic quality control operations compared to state-of-the-art applications yet requires fewer compute resources. Moreover, RabbitQCPlus is at least 4 times faster than other applications when processing gzip-compressed FASTQ files and 1.3 times faster with the error correction module turned on. Furthermore, it takes less than 4 minutes to process 280 GB of plain FASTQ sequencing data, while other applications take at least 22 minutes on a 48-core server when enabling the per-read over-representation analysis. C++ sources are available at https://github.com/RabbitBio/RabbitQCPlus.


Subject(s)
Data Compression , Software , High-Throughput Nucleotide Sequencing , Quality Control , Algorithms , Sequence Analysis, DNA
9.
Nucleic Acids Res ; 50(D1): D747-D757, 2022 01 07.
Article in English | MEDLINE | ID: mdl-34554255

ABSTRACT

Many open access transcriptomic data of coronavirus disease 2019 (COVID-19) were generated, they have great heterogeneity and are difficult to analyze. To utilize these invaluable data for better understanding of COVID-19, additional software should be developed. Especially for researchers without bioinformatic skills, a user-friendly platform is mandatory. We developed the COVID19db platform (http://hpcc.siat.ac.cn/covid19db & http://www.biomedical-web.com/covid19db) that provides 39 930 drug-target-pathway interactions and 95 COVID-19 related datasets, which include transcriptomes of 4127 human samples across 13 body sites associated with the exposure of 33 microbes and 33 drugs/agents. To facilitate data application, each dataset was standardized and annotated with rich clinical information. The platform further provides 14 different analytical applications to analyze various mechanisms underlying COVID-19. Moreover, the 14 applications enable researchers to customize grouping and setting for different analyses and allow them to perform analyses using their own data. Furthermore, a Drug Discovery tool is designed to identify potential drugs and targets at whole transcriptomic scale. For proof of concept, we used COVID19db and identified multiple potential drugs and targets for COVID-19. In summary, COVID19db provides user-friendly web interfaces to freely analyze, download data, and submit new data for further integration, it can accelerate the identification of effective strategies against COVID-19.


Subject(s)
Antiviral Agents/pharmacology , COVID-19 Drug Treatment , Databases, Factual , Drug Discovery/methods , COVID-19/genetics , Humans , Transcriptome
10.
Nucleic Acids Res ; 50(D1): D83-D92, 2022 01 07.
Article in English | MEDLINE | ID: mdl-34530446

ABSTRACT

Many circRNA transcriptome data were deposited in public resources, but these data show great heterogeneity. Researchers without bioinformatics skills have difficulty in investigating these invaluable data or their own data. Here, we specifically designed circMine (http://hpcc.siat.ac.cn/circmine and http://www.biomedical-web.com/circmine/) that provides 1 821 448 entries formed by 136 871 circRNAs, 87 diseases and 120 circRNA transcriptome datasets of 1107 samples across 31 human body sites. circMine further provides 13 online analytical functions to comprehensively investigate these datasets to evaluate the clinical and biological significance of circRNA. To improve the data applicability, each dataset was standardized and annotated with relevant clinical information. All of the 13 analytic functions allow users to group samples based on their clinical data and assign different parameters for different analyses, and enable them to perform these analyses using their own circRNA transcriptomes. Moreover, three additional tools were developed in circMine to systematically discover the circRNA-miRNA interaction and circRNA translatability. For example, we systematically discovered five potential translatable circRNAs associated with prostate cancer progression using circMine. In summary, circMine provides user-friendly web interfaces to browse, search, analyze and download data freely, and submit new data for further integration, and it can be an important resource to discover significant circRNA in different diseases.


Subject(s)
Computational Biology , Databases, Genetic , RNA, Circular/genetics , Transcriptome/genetics , Genetic Diseases, Inborn/genetics , Humans , Neoplasms/genetics , RNA, Circular/classification
11.
Proteins ; 91(12): 1837-1849, 2023 Dec.
Article in English | MEDLINE | ID: mdl-37606194

ABSTRACT

We introduce a deep learning-based ligand pose scoring model called zPoseScore for predicting protein-ligand complexes in the 15th Critical Assessment of Protein Structure Prediction (CASP15). Our contributions are threefold: first, we generate six training and evaluation data sets by employing advanced data augmentation and sampling methods. Second, we redesign the "zFormer" module, inspired by AlphaFold2's Evoformer, to efficiently describe protein-ligand interactions. This module enables the extraction of protein-ligand paired features that lead to accurate predictions. Finally, we develop the zPoseScore framework with zFormer for scoring and ranking ligand poses, allowing for atomic-level protein-ligand feature encoding and fusion to output refined ligand poses and ligand per-atom deviations. Our results demonstrate excellent performance on various testing data sets, achieving Pearson's correlation R = 0.783 and 0.659 for ranking docking decoys generated based on experimental and predicted protein structures of CASF-2016 protein-ligand complexes. Additionally, we obtain an averaged local distance difference test (lDDT pli = 0.558) of AIchemy LIG2 in CASP15 for de novo protein-ligand complex structure predictions. Detailed analysis shows that accurate ligand binding site prediction and side-chain orientation are crucial for achieving better prediction performance. Our proposed model is one of the most accurate protein-ligand pose prediction models and could serve as a valuable tool in small molecule drug discovery.


Subject(s)
Proteins , Ligands , Protein Binding , Proteins/chemistry , Binding Sites , Molecular Docking Simulation
12.
Bioinformatics ; 38(10): 2932-2933, 2022 05 13.
Article in English | MEDLINE | ID: mdl-35561184

ABSTRACT

MOTIVATION: Detection and identification of viruses and microorganisms in sequencing data plays an important role in pathogen diagnosis and research. However, existing tools for this problem often suffer from high runtimes and memory consumption. RESULTS: We present RabbitV, a tool for rapid detection of viruses and microorganisms in Illumina sequencing datasets based on fast identification of unique k-mers. It can exploit the power of modern multi-core CPUs by using multi-threading, vectorization and fast data parsing. Experiments show that RabbitV outperforms fastv by a factor of at least 42.5 and 14.4 in unique k-mer generation (RabbitUniq) and pathogen identification (RabbitV), respectively. Furthermore, RabbitV is able to detect COVID-19 from 40 samples of sequencing data (255 GB in FASTQ format) in only 320 s. AVAILABILITY AND IMPLEMENTATION: RabbitUniq and RabbitV are available at https://github.com/RabbitBio/RabbitUniq and https://github.com/RabbitBio/RabbitV. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
COVID-19 , Viruses , Algorithms , High-Throughput Nucleotide Sequencing , Humans , Sequence Analysis, DNA , Software , Viruses/genetics
13.
J Chem Inf Model ; 63(3): 835-845, 2023 02 13.
Article in English | MEDLINE | ID: mdl-36724090

ABSTRACT

Many bioactive peptides demonstrated therapeutic effects over complicated diseases, such as antiviral, antibacterial, anticancer, etc. It is possible to generate a large number of potentially bioactive peptides using deep learning in a manner analogous to the generation of de novo chemical compounds using the acquired bioactive peptides as a training set. Such generative techniques would be significant for drug development since peptides are much easier and cheaper to synthesize than compounds. Despite the limited availability of deep learning-based peptide-generating models, we have built an LSTM model (called LSTM_Pep) to generate de novo peptides and fine-tuned the model to generate de novo peptides with specific prospective therapeutic benefits. Remarkably, the Antimicrobial Peptide Database has been effectively utilized to generate various kinds of potential active de novo peptides. We proposed a pipeline for screening those generated peptides for a given target and used the main protease of SARS-COV-2 as a proof-of-concept. Moreover, we have developed a deep learning-based protein-peptide prediction model (DeepPep) for rapid screening of the generated peptides for the given targets. Together with the generating model, we have demonstrated that iteratively fine-tuning training, generating, and screening peptides for higher-predicted binding affinity peptides can be achieved. Our work sheds light on developing deep learning-based methods and pipelines to effectively generate and obtain bioactive peptides with a specific therapeutic effect and showcases how artificial intelligence can help discover de novo bioactive peptides that can bind to a particular target.


Subject(s)
COVID-19 , Deep Learning , Humans , Artificial Intelligence , Drug Design , SARS-CoV-2 , Peptides/pharmacology
14.
Methods ; 205: 247-262, 2022 09.
Article in English | MEDLINE | ID: mdl-35878751

ABSTRACT

Identifying native-like protein-ligand complexes (PLCs) from an abundance of docking decoys is critical for large-scale virtual drug screening in early-stage drug discovery lead searching efforts. Providing reliable prediction is still a challenge for most current affinity predicting models because of a lack of non-binding data during model training, lost critical physical-chemical features, and difficulties in learning abstract information with limited neural layers. In this work, we proposed a deep learning model, DeepBindBC, for classifying putative ligands as binding or non-binding. Our model incorporates information on non-binding interactions, making it more suitable for real applications. ResNet model architecture and more detailed atom type representation guarantee implicit features can be learned more accurately. Here, we show that DeepBindBC outperforms Autodock Vina, Pafnucy, and DLSCORE for three DUD.E testing sets. Moreover, DeepBindBC identified a novel human pancreatic α-amylase binder validated by a fluorescence spectral experiment (Ka = 1.0 × 105 M). Furthermore, DeepBindBC can be used as a core component of a hybrid virtual screening pipeline that incorporating many other complementary methods, such as DFCNN, Autodock Vina docking, and pocket molecular dynamics simulation. Additionally, an online web server based on the model is available at http://cbblab.siat.ac.cn/DeepBindBC/index.php for the user's convenience. Our model and the web server provide alternative tools in the early steps of drug discovery by providing accurate identification of native-like PLCs.


Subject(s)
Deep Learning , Humans , Ligands , Molecular Docking Simulation , Molecular Dynamics Simulation , Protein Binding , Proteins/chemistry
15.
Environ Res ; 227: 115710, 2023 06 15.
Article in English | MEDLINE | ID: mdl-36933634

ABSTRACT

Vegetation restoration projects can not only improve water quality by absorbing and transferring pollutants and nutrients from non-vegetation sources, but also protect biodiversity by providing habitat for biological growth. However, the mechanism of the protistan and bacterial assembly processes in the vegetation restoration project were rarely explored. To address this, based on 18 S rRNA and 16 S rRNA high-throughput sequencing, we investigated the mechanism of protistan and bacterial community assembly processes, environmental conditions, and microbial interactions in the rivers with (out) vegetation restoration. The results indicated that the deterministic process dominated the protistan and bacterial community assembly (94.29% and 92.38%), influenced by biotic and abiotic factors. For biotic factors, microbial network connectivity was higher in the vegetation zone (average degree = 20.34) than in the bare zone (average degree = 11.00). For abiotic factors, the concentration of dissolved organic carbon ([DOC]) was the most important environmental factor affecting the microbial community composition. [DOC] was lower significantly in vegetation zone (18.65 ± 6.34 mg/L) than in the bare zone (28.22 ± 4.82 mg/L). In overlying water, vegetation restoration upregulated the protein-like fluorescence components (C1 and C2) by 1.26 and 1.01-folds and downregulated the terrestrial humic-like fluorescence components (C3 and C4) by 0.54 and 0.55-folds, respectively. The different DOM components guided bacteria and protists to select different interactive relationships. The protein-like DOM components led to bacterial competition, whereas the humus-like DOM components resulted in protistan competition. Finally, the structural equation model was established to explain that DOM components can affect protistan and bacterial diversity by providing substrates, facilitating microbial interactions, and promoting nutrient input. In general, our study provides insights into the responses of vegetation restored ecosystems to the dynamics and interactives in the anthropogenically influenced river and evaluates the ecological restoration performance of vegetation restoration from a molecular biology perspective.


Subject(s)
Dissolved Organic Matter , Microbiota , Rivers/chemistry , Water Quality , Bacteria/genetics , Spectrometry, Fluorescence
16.
Int J Mol Sci ; 24(22)2023 Nov 07.
Article in English | MEDLINE | ID: mdl-38003217

ABSTRACT

The automatic detection of cells in microscopy image sequences is a significant task in biomedical research. However, routine microscopy images with cells, which are taken during the process whereby constant division and differentiation occur, are notoriously difficult to detect due to changes in their appearance and number. Recently, convolutional neural network (CNN)-based methods have made significant progress in cell detection and tracking. However, these approaches require many manually annotated data for fully supervised training, which is time-consuming and often requires professional researchers. To alleviate such tiresome and labor-intensive costs, we propose a novel weakly supervised learning cell detection and tracking framework that trains the deep neural network using incomplete initial labels. Our approach uses incomplete cell markers obtained from fluorescent images for initial training on the Induced Pluripotent Stem (iPS) cell dataset, which is rarely studied for cell detection and tracking. During training, the incomplete initial labels were updated iteratively by combining detection and tracking results to obtain a model with better robustness. Our method was evaluated using two fields of the iPS cell dataset, along with the cell detection accuracy (DET) evaluation metric from the Cell Tracking Challenge (CTC) initiative, and it achieved 0.862 and 0.924 DET, respectively. The transferability of the developed model was tested using the public dataset FluoN2DH-GOWT1, which was taken from CTC; this contains two datasets with reference annotations. We randomly removed parts of the annotations in each labeled data to simulate the initial annotations on the public dataset. After training the model on the two datasets, with labels that comprise 10% cell markers, the DET improved from 0.130 to 0.903 and 0.116 to 0.877. When trained with labels that comprise 60% cell markers, the performance was better than the model trained using the supervised learning method. This outcome indicates that the model's performance improved as the quality of the labels used for training increased.


Subject(s)
Neural Networks, Computer , Supervised Machine Learning , Image Processing, Computer-Assisted/methods
17.
Bioinformatics ; 37(6): 873-875, 2021 05 05.
Article in English | MEDLINE | ID: mdl-32845281

ABSTRACT

MOTIVATION: Mash is a popular hash-based genome analysis toolkit with applications to important downstream analyses tasks such as clustering and assembly. However, Mash is currently not able to fully exploit the capabilities of modern multi-core architectures, which in turn leads to high runtimes for large-scale genomic datasets. RESULTS: We present RabbitMash, an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization and fast I/O. We show that our approach achieves speedups of at least 1.3, 9.8, 8.5 and 4.4 compared to Mash for the operations sketch, dist, triangle and screen, respectively. Furthermore, RabbitMash is able to compute the all-versus-all distances of 100 321 genomes in <5 min on a 40-core workstation while Mash requires over 40 min. AVAILABILITY AND IMPLEMENTATION: RabbitMash is available at https://github.com/ZekunYin/RabbitMash. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Software , Computers , Genome , Genomics
18.
Bioinformatics ; 37(4): 573-574, 2021 05 01.
Article in English | MEDLINE | ID: mdl-32790850

ABSTRACT

MOTIVATION: Modern sequencing technologies continue to revolutionize many areas of biology and medicine. Since the generated datasets are error-prone, downstream applications usually require quality control methods to pre-process FASTQ files. However, existing tools for this task are currently not able to fully exploit the capabilities of computing platforms leading to slow runtimes. RESULTS: We present RabbitQC, an extremely fast integrated quality control tool for FASTQ files, which can take full advantage of modern hardware. It includes a variety of operations and supports different sequencing technologies (Illumina, Oxford Nanopore and PacBio). RabbitQC achieves speedups between one and two orders-of-magnitude compared to other state-of-the-art tools. AVAILABILITY AND IMPLEMENTATION: C++ sources and binaries are available at https://github.com/ZekunYin/RabbitQC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Nanopores , Software , High-Throughput Nucleotide Sequencing , Quality Control , Sequence Analysis, DNA
19.
PLoS Comput Biol ; 17(5): e1009027, 2021 05.
Article in English | MEDLINE | ID: mdl-34029314

ABSTRACT

Sequence-based residue contact prediction plays a crucial role in protein structure reconstruction. In recent years, the combination of evolutionary coupling analysis (ECA) and deep learning (DL) techniques has made tremendous progress for residue contact prediction, thus a comprehensive assessment of current methods based on a large-scale benchmark data set is very needed. In this study, we evaluate 18 contact predictors on 610 non-redundant proteins and 32 CASP13 targets according to a wide range of perspectives. The results show that different methods have different application scenarios: (1) DL methods based on multi-categories of inputs and large training sets are the best choices for low-contact-density proteins such as the intrinsically disordered ones and proteins with shallow multi-sequence alignments (MSAs). (2) With at least 5L (L is sequence length) effective sequences in the MSA, all the methods show the best performance, and methods that rely only on MSA as input can reach comparable achievements as methods that adopt multi-source inputs. (3) For top L/5 and L/2 predictions, DL methods can predict more hydrophobic interactions while ECA methods predict more salt bridges and disulfide bonds. (4) ECA methods can detect more secondary structure interactions, while DL methods can accurately excavate more contact patterns and prune isolated false positives. In general, multi-input DL methods with large training sets dominate current approaches with the best overall performance. Despite the great success of current DL methods must be stated the fact that there is still much room left for further improvement: (1) With shallow MSAs, the performance will be greatly affected. (2) Current methods show lower precisions for inter-domain compared with intra-domain contact predictions, as well as very high imbalances in precisions between intra-domains. (3) Strong prediction similarities between DL methods indicating more feature types and diversified models need to be developed. (4) The runtime of most methods can be further optimized.


Subject(s)
Computational Biology/methods , Proteins/chemistry , Amino Acid Sequence , Datasets as Topic , Deep Learning , Prospective Studies , Retrospective Studies
20.
Int J Mol Sci ; 23(12)2022 Jun 17.
Article in English | MEDLINE | ID: mdl-35743218

ABSTRACT

Circular RNAs (circRNAs) are RNA molecules formed by joining a downstream 3 splice donor site and an upstream 5 splice acceptor site. Several recent studies have identified circRNAs as potential biomarker for different diseases. A number of methods are available for the identification of circRNAs. The circRNA identification methods cannot provide full-length sequences. Reconstruction of the full-length sequences is crucial for the downstream analyses of circRNA research including differential expression analysis, circRNA-miRNA interaction analysis and other functional studies of the circRNAs. However, a limited number of methods are available in the literature for the reconstruction of full-length circRNA sequences. We developed a new method, circRNA-full, for full-length circRNA sequence reconstruction utilizing chimeric alignment information from the STAR aligner. To evaluate our method, we used full-length circRNA sequences produced by isocirc and ciri-long using long-reads RNA-seq data. Our method achieved better reconstruction rate, precision, sensitivity and F1 score than the existing full-length circRNA sequence reconstruction tool ciri-full for both human and mouse data.


Subject(s)
RNA Splice Sites , RNA, Circular , Animals , Mice , RNA/genetics , RNA/metabolism , RNA, Circular/genetics , RNA-Seq
SELECTION OF CITATIONS
SEARCH DETAIL