Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
Add more filters










Publication year range
1.
Brief Bioinform ; 25(4)2024 May 23.
Article in English | MEDLINE | ID: mdl-39007599

ABSTRACT

The interaction between T-cell receptors (TCRs) and peptides (epitopes) presented by major histocompatibility complex molecules (MHC) is fundamental to the immune response. Accurate prediction of TCR-epitope interactions is crucial for advancing the understanding of various diseases and their prevention and treatment. Existing methods primarily rely on sequence-based approaches, overlooking the inherent topology structure of TCR-epitope interaction networks. In this study, we present $GTE$, a novel heterogeneous Graph neural network model based on inductive learning to capture the topological structure between TCRs and Epitopes. Furthermore, we address the challenge of constructing negative samples within the graph by proposing a dynamic edge update strategy, enhancing model learning with the nonbinding TCR-epitope pairs. Additionally, to overcome data imbalance, we adapt the Deep AUC Maximization strategy to the graph domain. Extensive experiments are conducted on four public datasets to demonstrate the superiority of exploring underlying topological structures in predicting TCR-epitope interactions, illustrating the benefits of delving into complex molecular networks. The implementation code and data are available at https://github.com/uta-smile/GTE.


Subject(s)
Receptors, Antigen, T-Cell , Receptors, Antigen, T-Cell/chemistry , Receptors, Antigen, T-Cell/immunology , Receptors, Antigen, T-Cell/metabolism , Humans , Epitopes, T-Lymphocyte/immunology , Epitopes, T-Lymphocyte/chemistry , Neural Networks, Computer , Computational Biology/methods , Protein Binding , Epitopes/chemistry , Epitopes/immunology , Algorithms , Software
2.
J Comput Biol ; 31(3): 213-228, 2024 03.
Article in English | MEDLINE | ID: mdl-38531049

ABSTRACT

Molecular prediction tasks normally demand a series of professional experiments to label the target molecule, which suffers from the limited labeled data problem. One of the semisupervised learning paradigms, known as self-training, utilizes both labeled and unlabeled data. Specifically, a teacher model is trained using labeled data and produces pseudo labels for unlabeled data. These labeled and pseudo-labeled data are then jointly used to train a student model. However, the pseudo labels generated from the teacher model are generally not sufficiently accurate. Thus, we propose a robust self-training strategy by exploring robust loss function to handle such noisy labels in two paradigms, that is, generic and adaptive. We have conducted experiments on three molecular biology prediction tasks with four backbone models to gradually evaluate the performance of the proposed robust self-training strategy. The results demonstrate that the proposed method enhances prediction performance across all tasks, notably within molecular regression tasks, where there has been an average enhancement of 41.5%. Furthermore, the visualization analysis confirms the superiority of our method. Our proposed robust self-training is a simple yet effective strategy that efficiently improves molecular biology prediction performance. It tackles the labeled data insufficient issue in molecular biology by taking advantage of both labeled and unlabeled data. Moreover, it can be easily embedded with any prediction task, which serves as a universal approach for the bioinformatics community.


Subject(s)
Computational Biology , Molecular Biology , Humans , Supervised Machine Learning
3.
Chem Res Toxicol ; 36(8): 1206-1226, 2023 08 21.
Article in English | MEDLINE | ID: mdl-37562046

ABSTRACT

The development of new drugs is time-consuming and expensive, and as such, accurately predicting the potential toxicity of a drug candidate is crucial in ensuring its safety and efficacy. Recently, deep graph learning has become prevalent in this field due to its computational power and cost efficiency. Many novel deep graph learning methods aid toxicity prediction and further prompt drug development. This review aims to connect fundamental knowledge with burgeoning deep graph learning methods. We first summarize the essential components of deep graph learning models for toxicity prediction, including molecular descriptors, molecular representations, evaluation metrics, validation methods, and data sets. Furthermore, based on various graph-related representations of molecules, we introduce several representative studies and methods for toxicity prediction from the perspective of GNN architectures and graph pretrained models. Compared to other types of models, deep graph models not only advance in higher accuracy and efficiency but also provide more intuitive insights, which is significant in the development of model interpretation and generalization ability. The graph pretrained models are emerging as they can extract prominent features from large-scale unlabeled molecular graph data and improve the performance of downstream toxicity prediction tasks. We hope this survey can serve as a handbook for individuals interested in exploring deep graph learning for toxicity prediction.


Subject(s)
Drug Development , Pharmaceutical Preparations , Drug-Related Side Effects and Adverse Reactions
4.
Front Big Data ; 6: 1108659, 2023.
Article in English | MEDLINE | ID: mdl-36936996

ABSTRACT

The accurate segmentation of nuclei is crucial for cancer diagnosis and further clinical treatments. To successfully train a nuclei segmentation network in a fully-supervised manner for a particular type of organ or cancer, we need the dataset with ground-truth annotations. However, such well-annotated nuclei segmentation datasets are highly rare, and manually labeling an unannotated dataset is an expensive, time-consuming, and tedious process. Consequently, we require to discover a way for training the nuclei segmentation network with unlabeled dataset. In this paper, we propose a model named NuSegUDA for nuclei segmentation on the unlabeled dataset (target domain). It is achieved by applying Unsupervised Domain Adaptation (UDA) technique with the help of another labeled dataset (source domain) that may come from different type of organ, cancer, or source. We apply UDA technique at both of feature space and output space. We additionally utilize a reconstruction network and incorporate adversarial learning into it so that the source-domain images can be accurately translated to the target-domain for further training of the segmentation network. We validate our proposed NuSegUDA on two public nuclei segmentation datasets, and obtain significant improvement as compared with the baseline methods. Extensive experiments also verify the contribution of newly proposed image reconstruction adversarial loss, and target-translated source supervised loss to the performance boost of NuSegUDA. Finally, considering the scenario when we have a small number of annotations available from the target domain, we extend our work and propose NuSegSSDA, a Semi-Supervised Domain Adaptation (SSDA) based approach.

5.
J Comput Biol ; 30(1): 82-94, 2023 01.
Article in English | MEDLINE | ID: mdl-35972373

ABSTRACT

Molecule generation is the procedure to generate initial novel molecule proposals for molecule design. Molecules are first projected into continuous vectors in chemical latent space, and then, these embedding vectors are decoded into molecules under the variational autoencoder (VAE) framework. The continuous latent space of VAE can be utilized to generate novel molecules with desired chemical properties and further optimize the desired chemical properties of molecules. However, there is a posterior collapse problem with the conventional recurrent neural network-based VAEs for the molecule sequence generation, which deteriorates the generation performance. We investigate the posterior collapse problem and find that the underestimated reconstruction loss is the main factor in the posterior collapse problem in molecule sequence generation. To support our conclusion, we present both analytical and experimental evidence. What is more, we propose an efficient and effective solution to fix the problem and prevent posterior collapse. As a result, our method achieves competitive reconstruction accuracy and validity score on the benchmark data sets.


Subject(s)
Benchmarking , Neural Networks, Computer , Sulfadiazine
6.
Biomolecules ; 12(6)2022 06 02.
Article in English | MEDLINE | ID: mdl-35740899

ABSTRACT

The secondary structure of proteins is significant for studying the three-dimensional structure and functions of proteins. Several models from image understanding and natural language modeling have been successfully adapted in the protein sequence study area, such as Long Short-term Memory (LSTM) network and Convolutional Neural Network (CNN). Recently, Gated Convolutional Neural Network (GCNN) has been proposed for natural language processing. It has achieved high levels of sentence scoring, as well as reduced the latency. Conditionally Parameterized Convolution (CondConv) is another novel study which has gained great success in the image processing area. Compared with vanilla CNN, CondConv uses extra sample-dependant modules to conditionally adjust the convolutional network. In this paper, we propose a novel Conditionally Parameterized Convolutional network (CondGCNN) which utilizes the power of both CondConv and GCNN. CondGCNN leverages an ensemble encoder to combine the capabilities of both LSTM and CondGCNN to encode protein sequences by better capturing protein sequential features. In addition, we explore the similarity between the secondary structure prediction problem and the image segmentation problem, and propose an ASP network (Atrous Spatial Pyramid Pooling (ASPP) based network) to capture fine boundary details in secondary structure. Extensive experiments show that the proposed method can achieve higher performance on protein secondary structure prediction task than existing methods on CB513, Casp11, CASP12, CASP13, and CASP14 datasets. We also conducted ablation studies over each component to verify the effectiveness. Our method is expected to be useful for any protein related prediction tasks, which is not limited to protein secondary structure prediction.


Subject(s)
Deep Learning , Image Processing, Computer-Assisted/methods , Neural Networks, Computer , Protein Structure, Secondary , Proteins/chemistry
7.
Bioinformatics ; 38(7): 2003-2009, 2022 03 28.
Article in English | MEDLINE | ID: mdl-35094072

ABSTRACT

MOTIVATION: The crux of molecular property prediction is to generate meaningful representations of the molecules. One promising route is to exploit the molecular graph structure through graph neural networks (GNNs). Both atoms and bonds significantly affect the chemical properties of a molecule, so an expressive model ought to exploit both node (atom) and edge (bond) information simultaneously. Inspired by this observation, we explore the multi-view modeling with GNN (MVGNN) to form a novel paralleled framework, which considers both atoms and bonds equally important when learning molecular representations. In specific, one view is atom-central and the other view is bond-central, then the two views are circulated via specifically designed components to enable more accurate predictions. To further enhance the expressive power of MVGNN, we propose a cross-dependent message-passing scheme to enhance information communication of different views. The overall framework is termed as CD-MVGNN. RESULTS: We theoretically justify the expressiveness of the proposed model in terms of distinguishing non-isomorphism graphs. Extensive experiments demonstrate that CD-MVGNN achieves remarkably superior performance over the state-of-the-art models on various challenging benchmarks. Meanwhile, visualization results of the node importance are consistent with prior knowledge, which confirms the interpretability power of CD-MVGNN. AVAILABILITY AND IMPLEMENTATION: The code and data underlying this work are available in GitHub at https://github.com/uta-smile/CD-MVGNN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Benchmarking , Neural Networks, Computer
8.
J Comput Biol ; 28(4): 346-361, 2021 04.
Article in English | MEDLINE | ID: mdl-33617347

ABSTRACT

Accurate predictions of protein structure properties, for example, secondary structure and solvent accessibility, are essential in analyzing the structure and function of a protein. Position-specific scoring matrix (PSSM) features are widely used in the structure property prediction. However, some proteins may have low-quality PSSM features due to insufficient homologous sequences, leading to limited prediction accuracy. To address this limitation, we propose an enhancing scheme for PSSM features. We introduce the "Bagging MSA" (multiple sequence alignment) method to calculate PSSM features used to train our model, adopt a convolutional network to capture local context features and bidirectional long short-term memory for long-term dependencies, and integrate them under an unsupervised framework. Structure property prediction models are then built upon such enhanced PSSM features for more accurate predictions. Moreover, we develop two frameworks to evaluate the effectiveness of the enhanced PSSM features, which also bring proposed method into real-world scenarios. Empirical evaluation of CB513, CASP11, and CASP12 data sets indicates that our unsupervised enhancing scheme indeed generates more informative PSSM features for structure property prediction.


Subject(s)
Computational Biology , Deep Learning , Protein Conformation , Proteins/ultrastructure , Algorithms , Neural Networks, Computer , Position-Specific Scoring Matrices , Protein Structure, Secondary/genetics , Proteins/genetics , Sequence Alignment
9.
Chem Res Toxicol ; 34(2): 495-506, 2021 02 15.
Article in English | MEDLINE | ID: mdl-33347312

ABSTRACT

Drug-induced liver injury (DILI) is a crucial factor in determining the qualification of potential drugs. However, the DILI property is excessively difficult to obtain due to the complex testing process. Consequently, an in silico screening in the early stage of drug discovery would help to reduce the total development cost by filtering those drug candidates with a high risk to cause DILI. To serve the screening goal, we apply several computational techniques to predict the DILI property, including traditional machine learning methods and graph-based deep learning techniques. While deep learning models require large training data to tune huge model parameters, the DILI data set only contains a few hundred annotated molecules. To alleviate the data scarcity problem, we propose a property augmentation strategy to include massive training data with other property information. Extensive experiments demonstrate that our proposed method significantly outperforms all existing baselines on the DILI data set by obtaining a 81.4% accuracy using cross-validation with random splitting, 78.7% using leave-one-out cross-validation, and 76.5% using cross-validation with scaffold splitting.


Subject(s)
Chemical and Drug Induced Liver Injury , Deep Learning , Models, Chemical , Pharmaceutical Preparations/chemistry , Humans , Molecular Structure
10.
J Comput Biol ; 28(4): 362-364, 2021 04.
Article in English | MEDLINE | ID: mdl-33259717

ABSTRACT

Recently, a deep learning-based enhancing Position-Specific Scoring Matrix (PSSM) method (Bagging Multiple Sequence Alignment [MSA] Learning) Guo et al. has been proposed, and its effectiveness has been empirically proved. Program EPTool is the implementation of Bagging MSA Learning, which provides a complete training and evaluation workflow for the enhancing PSSM model. It is capable of handling different input data set and various computing algorithms to train the enhancing model, then eventually improve the PSSM quality for those proteins with insufficient homologous sequences. In addition, EPTool equips several convenient applications, such as PSSM features calculator, and PSSM features visualization. In this article, we propose designed EPTool and briefly introduce its functionalities and applications. The detailed accessible instructions are also provided.


Subject(s)
Protein Conformation , Protein Structure, Secondary/genetics , Proteins/ultrastructure , Software , Algorithms , Computational Biology , Databases, Protein , Position-Specific Scoring Matrices , Proteins/genetics , Sequence Alignment
11.
Front Plant Sci ; 9: 1023, 2018.
Article in English | MEDLINE | ID: mdl-30073008

ABSTRACT

The improvement of fiber quality is an essential goal in cotton breeding. In our previous studies, several quantitative trait loci (QTLs) contributing to improved fiber quality were identified in different introgressed chromosomal regions from Sea Island cotton (Gossypium barbadense L.) in a primary introgression population (Pop. A) of upland cotton (G. hirsutum L.). In the present study, to finely map introgressed major QTLs and accurately dissect the genetic contribution of the target introgressed chromosomal segments, we backcrossed two selected recombinant inbred lines (RILs) that presented desirable high fiber quality with their high lint-yielding recurrent parent to ultimately develop two secondary mapping populations (Pop. B and Pop. C). Totals of 20 and 27 QTLs for fiber quality were detected in Pop. B and Pop. C, respectively, including four and five for fiber length, four and eight for fiber micronaire, two and four for fiber uniformity, five and four for fiber elongation, and six and four for fiber strength, respectively. Two QTLs for lint percentage were detected only in Pop. C. In addition, seven stable QTLs were identified, including two for both fiber length and fiber strength and three for fiber elongation. Five QTL clusters for fiber quality were identified in the introgressed chromosomal regions, and negative effects of these chromosomal regions on lint percentage (a major lint yield parameter) were not observed. Candidate genes with a QTL-cluster associated with fiber strength and fiber length in the introgressed region of Chr.7 were further identified. The results may be helpful for revealing the genetic basis of superior fiber quality contributed by introgressed alleles from G. barbadense. Possible strategies involving marker-assisted selection (MAS) for simultaneously improving upland cotton fiber quality and lint yield in breeding programs was also discussed.

SELECTION OF CITATIONS
SEARCH DETAIL
...