RESUMO
Protein-ligand interaction prediction presents a significant challenge in drug design. Numerous machine learning and deep learning (DL) models have been developed to accurately identify docking poses of ligands and active compounds against specific targets. However, current models often suffer from inadequate accuracy or lack practical physical significance in their scoring systems. In this research paper, we introduce IGModel, a novel approach that utilizes the geometric information of protein-ligand complexes as input for predicting the root mean square deviation of docking poses and the binding strength (pKd, the negative value of the logarithm of binding affinity) within the same prediction framework. This ensures that the output scores carry intuitive meaning. We extensively evaluate the performance of IGModel on various docking power test sets, including the CASF-2016 benchmark, PDBbind-CrossDocked-Core and DISCO set, consistently achieving state-of-the-art accuracies. Furthermore, we assess IGModel's generalizability and robustness by evaluating it on unbiased test sets and sets containing target structures generated by AlphaFold2. The exceptional performance of IGModel on these sets demonstrates its efficacy. Additionally, we visualize the latent space of protein-ligand interactions encoded by IGModel and conduct interpretability analysis, providing valuable insights. This study presents a novel framework for DL-based prediction of protein-ligand interactions, contributing to the advancement of this field. The IGModel is available at GitHub repository https://github.com/zchwang/IGModel.
Assuntos
Aprendizado Profundo , Proteínas , Proteínas/química , Ligação Proteica , Ligantes , Desenho de FármacosRESUMO
Recent studies have shed light on the potential of circular RNA (circRNA) as a biomarker for disease diagnosis and as a nucleic acid vaccine. The exploration of these functionalities requires correct circRNA full-length sequences; however, existing assembly tools can only correctly assemble some circRNAs, and their performance can be further improved. Here, we introduce a novel feature known as the junction contig (JC), which is an extension of the back-splice junction (BSJ). Leveraging the strengths of both BSJ and JC, we present a novel method called JCcirc (https://github.com/cbbzhang/JCcirc). It enables efficient reconstruction of all types of circRNA full-length sequences and their alternative isoforms using splice graphs and fragment coverage. Our findings demonstrate the superiority of JCcirc over existing methods on human simulation datasets, and its average F1 score surpasses CircAST by 0.40 and outperforms both CIRI-full and circRNAfull by 0.13. For circRNAs below 400 bp, 400-800 bp, 800 bp-1200 bp and above 1200 bp, the correct assembly rates are 0.13, 0.09, 0.04 and 0.03 higher, respectively, than those achieved by existing methods. Moreover, JCcirc also outperforms existing assembly tools on other five model species datasets and real sequencing datasets. These results show that JCcirc is a robust tool for accurately assembling circRNA full-length sequences, laying the foundation for the functional analysis of circRNAs.
Assuntos
RNA Circular , RNA , Humanos , RNA Circular/genética , Análise de Sequência de RNA/métodos , Isoformas de Proteínas/genética , RNA/genéticaRESUMO
The recently reported machine learning- or deep learning-based scoring functions (SFs) have shown exciting performance in predicting protein-ligand binding affinities with fruitful application prospects. However, the differentiation between highly similar ligand conformations, including the native binding pose (the global energy minimum state), remains challenging that could greatly enhance the docking. In this work, we propose a fully differentiable, end-to-end framework for ligand pose optimization based on a hybrid SF called DeepRMSD+Vina combined with a multi-layer perceptron (DeepRMSD) and the traditional AutoDock Vina SF. The DeepRMSD+Vina, which combines (1) the root mean square deviation (RMSD) of the docking pose with respect to the native pose and (2) the AutoDock Vina score, is fully differentiable; thus is capable of optimizing the ligand binding pose to the energy-lowest conformation. Evaluated by the CASF-2016 docking power dataset, the DeepRMSD+Vina reaches a success rate of 94.4%, which outperforms most reported SFs to date. We evaluated the ligand conformation optimization framework in practical molecular docking scenarios (redocking and cross-docking tasks), revealing the high potentialities of this framework in drug design and discovery. Structural analysis shows that this framework has the ability to identify key physical interactions in protein-ligand binding, such as hydrogen-bonding. Our work provides a paradigm for optimizing ligand conformations based on deep learning algorithms. The DeepRMSD+Vina model and the optimization framework are available at GitHub repository https://github.com/zchwang/DeepRMSD-Vina_Optimization.
Assuntos
Aprendizado Profundo , Ligantes , Simulação de Acoplamento Molecular , Proteínas/química , Desenho de Fármacos , Ligação ProteicaRESUMO
MOTIVATION: Molecular docking is an invaluable computational tool with broad applications in computer-aided drug design and enzyme engineering. However, current molecular docking tools are typically implemented in languages such as C ++ for calculation speed, which lack flexibility and user-friendliness for further development. Moreover, validating the effectiveness of external scoring functions for molecular docking and screening within these frameworks is challenging, and implementing more efficient sampling strategies is not straightforward. RESULTS: To address these limitations, we have developed an open-source molecular docking framework, OpenDock, based on Python and PyTorch. This framework supports the integration of multiple scoring functions; some can be utilized during molecular docking and pose optimization, while others can be employed for post-processing scoring. In terms of sampling, the current version of this framework supports simulated annealing and Monte Carlo optimization. Additionally, it can be extended to include methods such as genetic algorithms and particle swarm optimization for sampling docking poses and protein side chain orientations. Distance constraints are also implemented to enable covalent docking, restricted docking or distance map constraints guided pose sampling. Overall, this framework serves as a valuable tool in drug design and enzyme engineering, offering significant flexibility for most protein-ligand modelling tasks. AVAILABILITY AND IMPLEMENTATION: OpenDock is publicly available at: Https://github.com/guyuehuo/opendock.
RESUMO
Bivalent Smac mimetics have been shown to possess binding affinity and pro-apoptotic activity similar to or more potent than that of native Smac, a protein dimer able to neutralize the anti-apoptotic activity of an inhibitor of caspase enzymes, XIAP, which endows cancer cells with resistance to anticancer drugs. We design five new bivalent Smac mimetics, which are formed by various linkers tethering two diazabicyclic cores being the IAP binding motifs. We built in silico models of the five mimetics by the TwistDock workflow and evaluated their conformational tendency, which suggests that compound 3, whose linker is n-hexylene, possess the highest binding potency among the five. After synthesis of these compounds, their ability in tumour cell growth inhibition and apoptosis induction displayed in experiments with SK-OV-3 and MDA-MB-231 cancer cell lines confirms our prediction. Among the five mimetics, compound 3 displays promising pro-apoptotic activity and deserves further optimization.
Assuntos
Antineoplásicos , Neoplasias , Humanos , Proteínas Inibidoras de Apoptose/metabolismo , Proteínas Inibidoras de Apoptose/farmacologia , Proteínas Inibidoras de Apoptose Ligadas ao Cromossomo X/metabolismo , Proteínas Inibidoras de Apoptose Ligadas ao Cromossomo X/farmacologia , Antineoplásicos/farmacologia , Antineoplásicos/química , Conformação Molecular , Apoptose , Linhagem Celular TumoralRESUMO
Molecular dynamics simulation is a crucial research domain within the life sciences, focusing on comprehending the mechanisms of biomolecular interactions at atomic scales. Protein simulation, as a critical subfield, often utilizes MD for implementation, with trajectory data play a pivotal role in drug discovery. The advancement of high-performance computing and deep learning technology becomes popular and critical to predict protein properties from vast trajectory data, posing challenges regarding data features extraction from the complicated simulation data and dimensionality reduction. Simultaneously, it is essential to provide a meaningful explanation of the biological mechanism behind dimensionality. To tackle this challenge, we propose a new unsupervised model named RevGraphVAMP to intelligently analyze the simulation trajectory. This model is based on the variational approach for Markov processes (VAMP) and integrates graph convolutional neural networks and physical constraint optimization to enhance the learning performance. Additionally, we introduce attention mechanism to assess the importance of key interaction region, facilitating the interpretation of molecular mechanism. In comparison to other VAMPNets models, our model showcases competitive performance, improved accuracy in state transition prediction, as demonstrated through its application to two public datasets and the Shank3-Rap1 complex, which is associated with autism spectrum disorder. Moreover, it enhanced dimensionality reduction discrimination across different substates and provides interpretable results for protein structural characterization.
Assuntos
Cadeias de Markov , Simulação de Dinâmica Molecular , Redes Neurais de Computação , Proteínas , Proteínas/química , Humanos , Aprendizado ProfundoRESUMO
Deep learning is an artificial intelligence technique in which models express geometric transformations over multiple levels. This method has shown great promise in various fields, including drug development. The availability of public structure databases prompted the researchers to use generative artificial intelligence models to narrow down their search of the chemical space, a novel approach to chemogenomics and de novo drug development. In this study, we developed a strategy that combined an accelerated LSTM_Chem (long short-term memory for de novo compounds generation), dense fully convolutional neural network (DFCNN), and docking to generate a large number of de novo small molecular chemical compounds for given targets. To demonstrate its efficacy and applicability, six important targets that account for various human disorders were used as test examples. Moreover, using the M protease as a proof-of-concept example, we find that iteratively training with previously selected candidates can significantly increase the chance of obtaining novel compounds with higher and higher predicted binding affinities. In addition, we also check the potential benefit of obtaining reliable final de novo compounds with the help of MD simulation and metadynamics simulation. The generation of de novo compounds and the discovery of binders against various targets proposed here would be a practical and effective approach. Assessing the efficacy of these top de novo compounds with biochemical studies is promising to promote related drug development.
Assuntos
Aprendizado Profundo , Inteligência Artificial , Simulação por Computador , Desenho de Fármacos , Humanos , Redes Neurais de ComputaçãoRESUMO
Scoring functions are important components in molecular docking for structure-based drug discovery. Traditional scoring functions, generally empirical- or force field-based, are robust and have proven to be useful for identifying hits and lead optimizations. Although multiple highly accurate deep learning- or machine learning-based scoring functions have been developed, their direct applications for docking and screening are limited. We describe a novel strategy to develop a reliable protein-ligand scoring function by augmenting the traditional scoring function Vina score using a correction term (OnionNet-SFCT). The correction term is developed based on an AdaBoost random forest model, utilizing multiple layers of contacts formed between protein residues and ligand atoms. In addition to the Vina score, the model considerably enhances the AutoDock Vina prediction abilities for docking and screening tasks based on different benchmarks (such as cross-docking dataset, CASF-2016, DUD-E and DUD-AD). Furthermore, our model could be combined with multiple docking applications to increase pose selection accuracies and screening abilities, indicating its wide usage for structure-based drug discoveries. Furthermore, in a reverse practice, the combined scoring strategy successfully identified multiple known receptors of a plant hormone. To summarize, the results show that the combination of data-driven model (OnionNet-SFCT) and empirical scoring function (Vina score) is a good scoring strategy that could be useful for structure-based drug discoveries and potentially target fishing in future.
Assuntos
Descoberta de Drogas , Proteínas , Descoberta de Drogas/métodos , Ligantes , Aprendizado de Máquina , Simulação de Acoplamento Molecular , Ligação Proteica , Proteínas/químicaRESUMO
Accurate protein-ligand binding poses are the prerequisites of structure-based binding affinity prediction and provide the structural basis for in-depth lead optimization in small molecule drug design. However, it is challenging to provide reasonable predictions of binding poses for different molecules due to the complexity and diversity of the chemical space of small molecules. Similarity-based molecular alignment techniques can effectively narrow the search range, as structurally similar molecules are likely to have similar binding modes, with higher similarity usually correlated to higher success rates. However, molecular similarity is not consistently high because molecules often require changes to achieve specific purposes, leading to reduced alignment precision. To address this issue, we propose a new alignment methodâZ-align. This method uses topological structural information as a criterion for evaluating similarity, reducing the reliance on molecular fingerprint similarity. Our method has achieved success rates significantly higher than those of other methods at moderate levels of similarity. Additionally, our approach can comprehensively and flexibly optimize bond lengths and angles of molecules, maintaining a high accuracy even when dealing with larger molecules. Consequently, our proposed solution helps in achieving more accurate binding poses in protein-ligand docking problems, facilitating the development of small molecule drugs. Z-align is freely available as a web server at https://cloud.zelixir.com/zalign/home.
Assuntos
Simulação de Acoplamento Molecular , Proteínas , Ligantes , Proteínas/química , Proteínas/metabolismo , Ligação Proteica , Desenho de Fármacos , Conformação Proteica , Sítios de LigaçãoRESUMO
Assessing the quality of sequencing data plays a crucial role in downstream data analysis. However, existing tools often achieve sub-optimal efficiency, especially when dealing with compressed files or performing complicated quality control operations such as over-representation analysis and error correction. We present RabbitQCPlus, an ultra-efficient quality control tool for modern multi-core systems. RabbitQCPlus uses vectorization, memory copy reduction, parallel (de)compression, and optimized data structures to achieve substantial performance gains. It is 1.1 to 5.4 times faster when performing basic quality control operations compared to state-of-the-art applications yet requires fewer compute resources. Moreover, RabbitQCPlus is at least 4 times faster than other applications when processing gzip-compressed FASTQ files and 1.3 times faster with the error correction module turned on. Furthermore, it takes less than 4 minutes to process 280 GB of plain FASTQ sequencing data, while other applications take at least 22 minutes on a 48-core server when enabling the per-read over-representation analysis. C++ sources are available at https://github.com/RabbitBio/RabbitQCPlus.
Assuntos
Compressão de Dados , Software , Sequenciamento de Nucleotídeos em Larga Escala , Controle de Qualidade , Algoritmos , Análise de Sequência de DNARESUMO
Many open access transcriptomic data of coronavirus disease 2019 (COVID-19) were generated, they have great heterogeneity and are difficult to analyze. To utilize these invaluable data for better understanding of COVID-19, additional software should be developed. Especially for researchers without bioinformatic skills, a user-friendly platform is mandatory. We developed the COVID19db platform (http://hpcc.siat.ac.cn/covid19db & http://www.biomedical-web.com/covid19db) that provides 39 930 drug-target-pathway interactions and 95 COVID-19 related datasets, which include transcriptomes of 4127 human samples across 13 body sites associated with the exposure of 33 microbes and 33 drugs/agents. To facilitate data application, each dataset was standardized and annotated with rich clinical information. The platform further provides 14 different analytical applications to analyze various mechanisms underlying COVID-19. Moreover, the 14 applications enable researchers to customize grouping and setting for different analyses and allow them to perform analyses using their own data. Furthermore, a Drug Discovery tool is designed to identify potential drugs and targets at whole transcriptomic scale. For proof of concept, we used COVID19db and identified multiple potential drugs and targets for COVID-19. In summary, COVID19db provides user-friendly web interfaces to freely analyze, download data, and submit new data for further integration, it can accelerate the identification of effective strategies against COVID-19.
Assuntos
Antivirais/farmacologia , Tratamento Farmacológico da COVID-19 , Bases de Dados Factuais , Descoberta de Drogas/métodos , COVID-19/genética , Humanos , TranscriptomaRESUMO
Many circRNA transcriptome data were deposited in public resources, but these data show great heterogeneity. Researchers without bioinformatics skills have difficulty in investigating these invaluable data or their own data. Here, we specifically designed circMine (http://hpcc.siat.ac.cn/circmine and http://www.biomedical-web.com/circmine/) that provides 1 821 448 entries formed by 136 871 circRNAs, 87 diseases and 120 circRNA transcriptome datasets of 1107 samples across 31 human body sites. circMine further provides 13 online analytical functions to comprehensively investigate these datasets to evaluate the clinical and biological significance of circRNA. To improve the data applicability, each dataset was standardized and annotated with relevant clinical information. All of the 13 analytic functions allow users to group samples based on their clinical data and assign different parameters for different analyses, and enable them to perform these analyses using their own circRNA transcriptomes. Moreover, three additional tools were developed in circMine to systematically discover the circRNA-miRNA interaction and circRNA translatability. For example, we systematically discovered five potential translatable circRNAs associated with prostate cancer progression using circMine. In summary, circMine provides user-friendly web interfaces to browse, search, analyze and download data freely, and submit new data for further integration, and it can be an important resource to discover significant circRNA in different diseases.
Assuntos
Biologia Computacional , Bases de Dados Genéticas , RNA Circular/genética , Transcriptoma/genética , Doenças Genéticas Inatas/genética , Humanos , Neoplasias/genética , RNA Circular/classificaçãoRESUMO
We introduce a deep learning-based ligand pose scoring model called zPoseScore for predicting protein-ligand complexes in the 15th Critical Assessment of Protein Structure Prediction (CASP15). Our contributions are threefold: first, we generate six training and evaluation data sets by employing advanced data augmentation and sampling methods. Second, we redesign the "zFormer" module, inspired by AlphaFold2's Evoformer, to efficiently describe protein-ligand interactions. This module enables the extraction of protein-ligand paired features that lead to accurate predictions. Finally, we develop the zPoseScore framework with zFormer for scoring and ranking ligand poses, allowing for atomic-level protein-ligand feature encoding and fusion to output refined ligand poses and ligand per-atom deviations. Our results demonstrate excellent performance on various testing data sets, achieving Pearson's correlation R = 0.783 and 0.659 for ranking docking decoys generated based on experimental and predicted protein structures of CASF-2016 protein-ligand complexes. Additionally, we obtain an averaged local distance difference test (lDDT pli = 0.558) of AIchemy LIG2 in CASP15 for de novo protein-ligand complex structure predictions. Detailed analysis shows that accurate ligand binding site prediction and side-chain orientation are crucial for achieving better prediction performance. Our proposed model is one of the most accurate protein-ligand pose prediction models and could serve as a valuable tool in small molecule drug discovery.
Assuntos
Proteínas , Ligantes , Ligação Proteica , Proteínas/química , Sítios de Ligação , Simulação de Acoplamento MolecularRESUMO
MOTIVATION: Detection and identification of viruses and microorganisms in sequencing data plays an important role in pathogen diagnosis and research. However, existing tools for this problem often suffer from high runtimes and memory consumption. RESULTS: We present RabbitV, a tool for rapid detection of viruses and microorganisms in Illumina sequencing datasets based on fast identification of unique k-mers. It can exploit the power of modern multi-core CPUs by using multi-threading, vectorization and fast data parsing. Experiments show that RabbitV outperforms fastv by a factor of at least 42.5 and 14.4 in unique k-mer generation (RabbitUniq) and pathogen identification (RabbitV), respectively. Furthermore, RabbitV is able to detect COVID-19 from 40 samples of sequencing data (255 GB in FASTQ format) in only 320 s. AVAILABILITY AND IMPLEMENTATION: RabbitUniq and RabbitV are available at https://github.com/RabbitBio/RabbitUniq and https://github.com/RabbitBio/RabbitV. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
COVID-19 , Vírus , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA , Software , Vírus/genéticaRESUMO
Many bioactive peptides demonstrated therapeutic effects over complicated diseases, such as antiviral, antibacterial, anticancer, etc. It is possible to generate a large number of potentially bioactive peptides using deep learning in a manner analogous to the generation of de novo chemical compounds using the acquired bioactive peptides as a training set. Such generative techniques would be significant for drug development since peptides are much easier and cheaper to synthesize than compounds. Despite the limited availability of deep learning-based peptide-generating models, we have built an LSTM model (called LSTM_Pep) to generate de novo peptides and fine-tuned the model to generate de novo peptides with specific prospective therapeutic benefits. Remarkably, the Antimicrobial Peptide Database has been effectively utilized to generate various kinds of potential active de novo peptides. We proposed a pipeline for screening those generated peptides for a given target and used the main protease of SARS-COV-2 as a proof-of-concept. Moreover, we have developed a deep learning-based protein-peptide prediction model (DeepPep) for rapid screening of the generated peptides for the given targets. Together with the generating model, we have demonstrated that iteratively fine-tuning training, generating, and screening peptides for higher-predicted binding affinity peptides can be achieved. Our work sheds light on developing deep learning-based methods and pipelines to effectively generate and obtain bioactive peptides with a specific therapeutic effect and showcases how artificial intelligence can help discover de novo bioactive peptides that can bind to a particular target.
Assuntos
COVID-19 , Aprendizado Profundo , Humanos , Inteligência Artificial , Desenho de Fármacos , SARS-CoV-2 , Peptídeos/farmacologiaRESUMO
Identifying native-like protein-ligand complexes (PLCs) from an abundance of docking decoys is critical for large-scale virtual drug screening in early-stage drug discovery lead searching efforts. Providing reliable prediction is still a challenge for most current affinity predicting models because of a lack of non-binding data during model training, lost critical physical-chemical features, and difficulties in learning abstract information with limited neural layers. In this work, we proposed a deep learning model, DeepBindBC, for classifying putative ligands as binding or non-binding. Our model incorporates information on non-binding interactions, making it more suitable for real applications. ResNet model architecture and more detailed atom type representation guarantee implicit features can be learned more accurately. Here, we show that DeepBindBC outperforms Autodock Vina, Pafnucy, and DLSCORE for three DUD.E testing sets. Moreover, DeepBindBC identified a novel human pancreatic α-amylase binder validated by a fluorescence spectral experiment (Ka = 1.0 × 105 M). Furthermore, DeepBindBC can be used as a core component of a hybrid virtual screening pipeline that incorporating many other complementary methods, such as DFCNN, Autodock Vina docking, and pocket molecular dynamics simulation. Additionally, an online web server based on the model is available at http://cbblab.siat.ac.cn/DeepBindBC/index.php for the user's convenience. Our model and the web server provide alternative tools in the early steps of drug discovery by providing accurate identification of native-like PLCs.
Assuntos
Aprendizado Profundo , Humanos , Ligantes , Simulação de Acoplamento Molecular , Simulação de Dinâmica Molecular , Ligação Proteica , Proteínas/químicaRESUMO
Vegetation restoration projects can not only improve water quality by absorbing and transferring pollutants and nutrients from non-vegetation sources, but also protect biodiversity by providing habitat for biological growth. However, the mechanism of the protistan and bacterial assembly processes in the vegetation restoration project were rarely explored. To address this, based on 18 S rRNA and 16 S rRNA high-throughput sequencing, we investigated the mechanism of protistan and bacterial community assembly processes, environmental conditions, and microbial interactions in the rivers with (out) vegetation restoration. The results indicated that the deterministic process dominated the protistan and bacterial community assembly (94.29% and 92.38%), influenced by biotic and abiotic factors. For biotic factors, microbial network connectivity was higher in the vegetation zone (average degree = 20.34) than in the bare zone (average degree = 11.00). For abiotic factors, the concentration of dissolved organic carbon ([DOC]) was the most important environmental factor affecting the microbial community composition. [DOC] was lower significantly in vegetation zone (18.65 ± 6.34 mg/L) than in the bare zone (28.22 ± 4.82 mg/L). In overlying water, vegetation restoration upregulated the protein-like fluorescence components (C1 and C2) by 1.26 and 1.01-folds and downregulated the terrestrial humic-like fluorescence components (C3 and C4) by 0.54 and 0.55-folds, respectively. The different DOM components guided bacteria and protists to select different interactive relationships. The protein-like DOM components led to bacterial competition, whereas the humus-like DOM components resulted in protistan competition. Finally, the structural equation model was established to explain that DOM components can affect protistan and bacterial diversity by providing substrates, facilitating microbial interactions, and promoting nutrient input. In general, our study provides insights into the responses of vegetation restored ecosystems to the dynamics and interactives in the anthropogenically influenced river and evaluates the ecological restoration performance of vegetation restoration from a molecular biology perspective.
Assuntos
Matéria Orgânica Dissolvida , Microbiota , Rios/química , Qualidade da Água , Bactérias/genética , Espectrometria de FluorescênciaRESUMO
The automatic detection of cells in microscopy image sequences is a significant task in biomedical research. However, routine microscopy images with cells, which are taken during the process whereby constant division and differentiation occur, are notoriously difficult to detect due to changes in their appearance and number. Recently, convolutional neural network (CNN)-based methods have made significant progress in cell detection and tracking. However, these approaches require many manually annotated data for fully supervised training, which is time-consuming and often requires professional researchers. To alleviate such tiresome and labor-intensive costs, we propose a novel weakly supervised learning cell detection and tracking framework that trains the deep neural network using incomplete initial labels. Our approach uses incomplete cell markers obtained from fluorescent images for initial training on the Induced Pluripotent Stem (iPS) cell dataset, which is rarely studied for cell detection and tracking. During training, the incomplete initial labels were updated iteratively by combining detection and tracking results to obtain a model with better robustness. Our method was evaluated using two fields of the iPS cell dataset, along with the cell detection accuracy (DET) evaluation metric from the Cell Tracking Challenge (CTC) initiative, and it achieved 0.862 and 0.924 DET, respectively. The transferability of the developed model was tested using the public dataset FluoN2DH-GOWT1, which was taken from CTC; this contains two datasets with reference annotations. We randomly removed parts of the annotations in each labeled data to simulate the initial annotations on the public dataset. After training the model on the two datasets, with labels that comprise 10% cell markers, the DET improved from 0.130 to 0.903 and 0.116 to 0.877. When trained with labels that comprise 60% cell markers, the performance was better than the model trained using the supervised learning method. This outcome indicates that the model's performance improved as the quality of the labels used for training increased.
Assuntos
Redes Neurais de Computação , Aprendizado de Máquina Supervisionado , Processamento de Imagem Assistida por Computador/métodosRESUMO
MOTIVATION: Mash is a popular hash-based genome analysis toolkit with applications to important downstream analyses tasks such as clustering and assembly. However, Mash is currently not able to fully exploit the capabilities of modern multi-core architectures, which in turn leads to high runtimes for large-scale genomic datasets. RESULTS: We present RabbitMash, an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization and fast I/O. We show that our approach achieves speedups of at least 1.3, 9.8, 8.5 and 4.4 compared to Mash for the operations sketch, dist, triangle and screen, respectively. Furthermore, RabbitMash is able to compute the all-versus-all distances of 100 321 genomes in <5 min on a 40-core workstation while Mash requires over 40 min. AVAILABILITY AND IMPLEMENTATION: RabbitMash is available at https://github.com/ZekunYin/RabbitMash. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Software , Computadores , Genoma , GenômicaRESUMO
MOTIVATION: Modern sequencing technologies continue to revolutionize many areas of biology and medicine. Since the generated datasets are error-prone, downstream applications usually require quality control methods to pre-process FASTQ files. However, existing tools for this task are currently not able to fully exploit the capabilities of computing platforms leading to slow runtimes. RESULTS: We present RabbitQC, an extremely fast integrated quality control tool for FASTQ files, which can take full advantage of modern hardware. It includes a variety of operations and supports different sequencing technologies (Illumina, Oxford Nanopore and PacBio). RabbitQC achieves speedups between one and two orders-of-magnitude compared to other state-of-the-art tools. AVAILABILITY AND IMPLEMENTATION: C++ sources and binaries are available at https://github.com/ZekunYin/RabbitQC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.