Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 249
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38517696

RESUMO

With the rapid development of single-molecule sequencing (SMS) technologies, the output read length is continuously increasing. Mapping such reads onto a reference genome is one of the most fundamental tasks in sequence analysis. Mapping sensitivity is becoming a major concern since high sensitivity can detect more aligned regions on the reference and obtain more aligned bases, which are useful for downstream analysis. In this study, we present pathMap, a novel k-mer graph-based mapper that is specifically designed for mapping SMS reads with high sensitivity. By viewing the alignment chain as a path containing as many anchors as possible in the matched k-mer graph, pathMap treats chaining as a path selection problem in the directed graph. pathMap iteratively searches the longest path in the remaining nodes; more candidate chains with high quality can be effectively detected and aligned. Compared to other state-of-the-art mapping methods such as minimap2 and Winnowmap2, experiment results on simulated and real-life datasets demonstrate that pathMap obtains the number of mapped chains at least 11.50% more than its closest competitor and increases the mapping sensitivity by 17.28% and 13.84% of bases over the next-best mapper for Pacific Biosciences and Oxford Nanopore sequencing data, respectively. In addition, pathMap is more robust to sequence errors and more sensitive to species- and strain-specific identification of pathogens using MinION reads.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Sequenciamento por Nanoporos , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genoma , Software , Algoritmos
2.
Brief Bioinform ; 23(3)2022 05 13.
Artigo em Inglês | MEDLINE | ID: mdl-35323901

RESUMO

MOTIVATION: MicroRNAs (miRNAs), as critical regulators, are involved in various fundamental and vital biological processes, and their abnormalities are closely related to human diseases. Predicting disease-related miRNAs is beneficial to uncovering new biomarkers for the prevention, detection, prognosis, diagnosis and treatment of complex diseases. RESULTS: In this study, we propose a multi-view Laplacian regularized deep factorization machine (DeepFM) model, MLRDFM, to predict novel miRNA-disease associations while improving the standard DeepFM. Specifically, MLRDFM improves DeepFM from two aspects: first, MLRDFM takes the relationships among items into consideration by regularizing their embedding features via their similarity-based Laplacians. In this study, miRNA Laplacian regularization integrates four types of miRNA similarity, while disease Laplacian regularization integrates two types of disease similarity. Second, to judiciously train our model, Laplacian eigenmaps are utilized to initialize the weights in the dense embedding layer. The experimental results on the latest HMDD v3.2 dataset show that MLRDFM improves the performance and reduces the overfitting phenomenon of DeepFM. Besides, MLRDFM is greatly superior to the state-of-the-art models in miRNA-disease association prediction in terms of different evaluation metrics with the 5-fold cross-validation. Furthermore, case studies further demonstrate the effectiveness of MLRDFM.


Assuntos
MicroRNAs , Algoritmos , Biologia Computacional/métodos , Predisposição Genética para Doença , Humanos , MicroRNAs/genética
3.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34864856

RESUMO

Drug repositioning is proposed to find novel usages for existing drugs. Among many types of drug repositioning approaches, predicting drug-drug interactions (DDIs) helps explore the pharmacological functions of drugs and achieves potential drugs for novel treatments. A number of models have been applied to predict DDIs. The DDI network, which is constructed from the known DDIs, is a common part in many of the existing methods. However, the functions of DDIs are different, and thus integrating them in a single DDI graph may overlook some useful information. We propose a graph convolutional network with multi-kernel (GCNMK) to predict potential DDIs. GCNMK adopts two DDI graph kernels for the graph convolutional layers, namely, increased DDI graph consisting of 'increase'-related DDIs and decreased DDI graph consisting of 'decrease'-related DDIs. The learned drug features are fed into a block with three fully connected layers for the DDI prediction. We compare various types of drug features, whereas the target feature of drugs outperforms all other types of features and their concatenated features. In comparison with three different DDI prediction methods, our proposed GCNMK achieves the best performance in terms of area under receiver operating characteristic curve and area under precision-recall curve. In case studies, we identify the top 20 potential DDIs from all unknown DDIs, and the top 10 potential DDIs from the unknown DDIs among breast, colorectal and lung neoplasms-related drugs. Most of them have evidence to support the existence of their interactions. fangxiang.wu@usask.ca.


Assuntos
Algoritmos , Reposicionamento de Medicamentos , Interações Medicamentosas , Curva ROC
4.
Brief Bioinform ; 23(3)2022 05 13.
Artigo em Inglês | MEDLINE | ID: mdl-35275996

RESUMO

MOTIVATION: Identifying disease-related genes is an important issue in computational biology. Module structure widely exists in biomolecule networks, and complex diseases are usually thought to be caused by perturbations of local neighborhoods in the networks, which can provide useful insights for the study of disease-related genes. However, the mining and effective utilization of the module structure is still challenging in such issues as a disease gene prediction. RESULTS: We propose a hybrid disease-gene prediction method integrating multiscale module structure (HyMM), which can utilize multiscale information from local to global structure to more effectively predict disease-related genes. HyMM extracts module partitions from local to global scales by multiscale modularity optimization with exponential sampling, and estimates the disease relatedness of genes in partitions by the abundance of disease-related genes within modules. Then, a probabilistic model for integration of gene rankings is designed in order to integrate multiple predictions derived from multiscale module partitions and network propagation, and a parameter estimation strategy based on functional information is proposed to further enhance HyMM's predictive power. By a series of experiments, we reveal the importance of module partitions at different scales, and verify the stable and good performance of HyMM compared with eight other state-of-the-arts and its further performance improvement derived from the parameter estimation. CONCLUSIONS: The results confirm that HyMM is an effective framework for integrating multiscale module structure to enhance the ability to predict disease-related genes, which may provide useful insights for the study of the multiscale module structure and its application in such issues as a disease-gene prediction.


Assuntos
Algoritmos , Biologia Computacional , Biologia Computacional/métodos , Modelos Estatísticos , Proteínas
5.
Brief Bioinform ; 23(2)2022 03 10.
Artigo em Inglês | MEDLINE | ID: mdl-35136949

RESUMO

In recent decades, exploring potential relationships between diseases has been an active research field. With the rapid accumulation of disease-related biomedical data, a lot of computational methods and tools/platforms have been developed to reveal intrinsic relationship between diseases, which can provide useful insights to the study of complex diseases, e.g. understanding molecular mechanisms of diseases and discovering new treatment of diseases. Human complex diseases involve both external phenotypic abnormalities and complex internal molecular mechanisms in organisms. Computational methods with different types of biomedical data from phenotype to genotype can evaluate disease-disease associations at different levels, providing a comprehensive perspective for understanding diseases. In this review, available biomedical data and databases for evaluating disease-disease associations are first summarized. Then, existing computational methods for disease-disease associations are reviewed and classified into five groups in terms of the usages of biomedical data, including disease semantic-based, phenotype-based, function-based, representation learning-based and text mining-based methods. Further, we summarize software tools/platforms for computation and analysis of disease-disease associations. Finally, we give a discussion and summary on the research of disease-disease associations. This review provides a systematic overview for current disease association research, which could promote the development and applications of computational methods and tools/platforms for disease-disease associations.


Assuntos
Biologia Computacional , Mineração de Dados , Biologia Computacional/métodos , Mineração de Dados/métodos , Bases de Dados Factuais , Fenótipo , Software
6.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34498677

RESUMO

Long non-coding RNAs (lncRNAs) are a class of RNA molecules with more than 200 nucleotides. A growing amount of evidence reveals that subcellular localization of lncRNAs can provide valuable insights into their biological functions. Existing computational methods for predicting lncRNA subcellular localization use k-mer features to encode lncRNA sequences. However, the sequence order information is lost by using only k-mer features. We proposed a deep learning framework, DeepLncLoc, to predict lncRNA subcellular localization. In DeepLncLoc, we introduced a new subsequence embedding method that keeps the order information of lncRNA sequences. The subsequence embedding method first divides a sequence into some consecutive subsequences and then extracts the patterns of each subsequence, last combines these patterns to obtain a complete representation of the lncRNA sequence. After that, a text convolutional neural network is employed to learn high-level features and perform the prediction task. Compared with traditional machine learning models, popular representation methods and existing predictors, DeepLncLoc achieved better performance, which shows that DeepLncLoc could effectively predict lncRNA subcellular localization. Our study not only presented a novel computational model for predicting lncRNA subcellular localization but also introduced a new subsequence embedding method which is expected to be applied in other sequence-based prediction tasks. The DeepLncLoc web server is freely accessible at http://bioinformatics.csu.edu.cn/DeepLncLoc/, and source code and datasets can be downloaded from https://github.com/CSUBioGroup/DeepLncLoc.


Assuntos
Aprendizado Profundo , RNA Longo não Codificante , Biologia Computacional/métodos , Redes Neurais de Computação , RNA Longo não Codificante/genética , Software
7.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34953465

RESUMO

Alzheimer's disease (AD) has a strong genetic predisposition. However, its risk genes remain incompletely identified. We developed an Alzheimer's brain gene network-based approach to predict AD-associated genes by leveraging the functional pattern of known AD-associated genes. Our constructed network outperformed existing networks in predicting AD genes. We then systematically validated the predictions using independent genetic, transcriptomic, proteomic data, neuropathological and clinical data. First, top-ranked genes were enriched in AD-associated pathways. Second, using external gene expression data from the Mount Sinai Brain Bank study, we found that the top-ranked genes were significantly associated with neuropathological and clinical traits, including the Consortium to Establish a Registry for Alzheimer's Disease score, Braak stage score and clinical dementia rating. The analysis of Alzheimer's brain single-cell RNA-seq data revealed cell-type-specific association of predicted genes with early pathology of AD. Third, by interrogating proteomic data in the Religious Orders Study and Memory and Aging Project and Baltimore Longitudinal Study of Aging studies, we observed a significant association of protein expression level with cognitive function and AD clinical severity. The network, method and predictions could become a valuable resource to advance the identification of risk genes for AD.


Assuntos
Doença de Alzheimer/genética , Doença de Alzheimer/metabolismo , Encéfalo/metabolismo , Redes Reguladoras de Genes , Predisposição Genética para Doença , Envelhecimento/genética , Perfilação da Expressão Gênica , Humanos , Estudos Longitudinais , Memória , Proteômica , RNA-Seq , Transcriptoma
8.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36458923

RESUMO

MOTIVATION: Protein essentiality is usually accepted to be a conditional trait and strongly affected by cellular environments. However, existing computational methods often do not take such characteristics into account, preferring to incorporate all available data and train a general model for all cell lines. In addition, the lack of model interpretability limits further exploration and analysis of essential protein predictions. RESULTS: In this study, we proposed DeepCellEss, a sequence-based interpretable deep learning framework for cell line-specific essential protein predictions. DeepCellEss utilizes a convolutional neural network and bidirectional long short-term memory to learn short- and long-range latent information from protein sequences. Further, a multi-head self-attention mechanism is used to provide residue-level model interpretability. For model construction, we collected extremely large-scale benchmark datasets across 323 cell lines. Extensive computational experiments demonstrate that DeepCellEss yields effective prediction performance for different cell lines and outperforms existing sequence-based methods as well as network-based centrality measures. Finally, we conducted some case studies to illustrate the necessity of considering specific cell lines and the superiority of DeepCellEss. We believe that DeepCellEss can serve as a useful tool for predicting essential proteins across different cell lines. AVAILABILITY AND IMPLEMENTATION: The DeepCellEss web server is available at http://csuligroup.com:8000/DeepCellEss. The source code and data underlying this study can be obtained from https://github.com/CSUBioGroup/DeepCellEss. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado Profundo , Proteínas/metabolismo , Sequência de Aminoácidos , Software , Linhagem Celular , Biologia Computacional/métodos
9.
Bioinformatics ; 39(12)2023 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-38058196

RESUMO

MOTIVATION: Longer reads produced by PacBio or Oxford Nanopore sequencers could more frequently span the breakpoints of structural variations (SVs) than shorter reads. Therefore, existing long-read mapping methods often generate wrong alignments and variant calls. Compared to deletions and insertions, inversion events are more difficult to be detected since the anchors in inversion regions are nonlinear to those in SV-free regions. To address this issue, this study presents a novel long-read mapping algorithm (named as invMap). RESULTS: For each long noisy read, invMap first locates the aligned region with a specifically designed scoring method for chaining, then checks the remaining anchors in the aligned region to discover potential inversions. We benchmark invMap on simulated datasets across different genomes and sequencing coverages, experimental results demonstrate that invMap is more accurate to locate aligned regions and call SVs for inversions than the competing methods. The real human genome sequencing dataset of NA12878 illustrates that invMap can effectively find more candidate variant calls for inversions than the competing methods. AVAILABILITY AND IMPLEMENTATION: The invMap software is available at https://github.com/zhang134/invMap.git.


Assuntos
Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Algoritmos , Genoma Humano , Inversão Cromossômica , Análise de Sequência de DNA/métodos
10.
Methods ; 216: 21-38, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37315825

RESUMO

Single-cell RNA-sequencing (scRNA-seq) data suffer from a lot of zeros. Such dropout events impede the downstream data analyses. We propose BayesImpute to infer and impute dropouts from the scRNA-seq data. Using the expression rate and coefficient of variation of the genes within the cell subpopulation, BayesImpute first determines likely dropouts, and then constructs the posterior distribution for each gene and uses the posterior mean to impute dropout values. Some simulated and real experiments show that BayesImpute can effectively identify dropout events and reduce the introduction of false positive signals. Additionally, BayesImpute successfully recovers the true expression levels of missing values, restores the gene-to-gene and cell-to-cell correlation coefficient, and maintains the biological information in bulk RNA-seq data. Furthermore, BayesImpute boosts the clustering and visualization of cell subpopulations and improves the identification of differentially expressed genes. We further demonstrate that, in comparison to other statistical-based imputation methods, BayesImpute is scalable and fast with minimal memory usage.


Assuntos
Análise da Expressão Gênica de Célula Única , Software , Análise de Sequência de RNA/métodos , Teorema de Bayes , Análise de Célula Única/métodos , Probabilidade , Perfilação da Expressão Gênica
11.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-32427285

RESUMO

Advances in sequencing technologies facilitate personalized disease-risk profiling and clinical diagnosis. In recent years, some great progress has been made in noninvasive diagnoses based on cell-free DNAs (cfDNAs). It exploits the fact that dead cells release DNA fragments into the circulation, and some DNA fragments carry information that indicates their tissues-of-origin (TOOs). Based on the signals used for identifying the TOOs of cfDNAs, the existing methods can be classified into three categories: cfDNA mutation-based methods, methylation pattern-based methods and cfDNA fragmentation pattern-based methods. In cfDNA mutation-based methods, the SNP information or the detected mutations in driven genes of certain diseases are employed to identify the TOOs of cfDNAs. Methylation pattern-based methods are developed to identify the TOOs of cfDNAs based on the tissue-specific methylation patterns. In cfDNA fragmentation pattern-based methods, cfDNA fragmentation patterns, such as nucleosome positioning or preferred end coordinates of cfDNAs, are used to predict the TOOs of cfDNAs. In this paper, the strategies and challenges in each category are reviewed. Furthermore, the representative applications based on the TOOs of cfDNAs, including noninvasive prenatal testing, noninvasive cancer screening, transplantation rejection monitoring and parasitic infection detection, are also reviewed. Moreover, the challenges and future work in identifying the TOOs of cfDNAs are discussed. Our research provides a comprehensive picture of the development and challenges in identifying the TOOs of cfDNAs, which may benefit bioinformatics researchers to develop new methods to improve the identification of the TOOs of cfDNAs.


Assuntos
Ácidos Nucleicos Livres/genética , Neoplasias/diagnóstico , Biomarcadores Tumorais/genética , Ácidos Nucleicos Livres/sangue , Metilação de DNA , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Mutação , Neoplasias/genética
12.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-34020541

RESUMO

Various microbes have proved to be closely related to the pathogenesis of human diseases. While many computational methods for predicting human microbe-disease associations (MDAs) have been developed, few systematic reviews on these methods have been reported. In this study, we provide a comprehensive overview of the existing methods. Firstly, we introduce the data used in existing MDA prediction methods. Secondly, we classify those methods into different categories by their nature and describe their algorithms and strategies in detail. Next, experimental evaluations are conducted on representative methods using different similarity data and calculation methods to compare their prediction performances. Based on the principles of computational methods and experimental results, we discuss the advantages and disadvantages of those methods and propose suggestions for the improvement of prediction performances. Considering the problems of the MDA prediction at present stage, we discuss future work from three perspectives including data, methods and formulations at the end.


Assuntos
Algoritmos , Simulação por Computador , Bases de Dados Factuais , Doença , Microbiota , Modelos Biológicos , Biologia Computacional , Humanos
13.
Brief Bioinform ; 22(2): 1604-1619, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-32043521

RESUMO

Drug repositioning can drastically decrease the cost and duration taken by traditional drug research and development while avoiding the occurrence of unforeseen adverse events. With the rapid advancement of high-throughput technologies and the explosion of various biological data and medical data, computational drug repositioning methods have been appealing and powerful techniques to systematically identify potential drug-target interactions and drug-disease interactions. In this review, we first summarize the available biomedical data and public databases related to drugs, diseases and targets. Then, we discuss existing drug repositioning approaches and group them based on their underlying computational models consisting of classical machine learning, network propagation, matrix factorization and completion, and deep learning based models. We also comprehensively analyze common standard data sets and evaluation metrics used in drug repositioning, and give a brief comparison of various prediction methods on the gold standard data sets. Finally, we conclude our review with a brief discussion on challenges in computational drug repositioning, which includes the problem of reducing the noise and incompleteness of biomedical data, the ensemble of various computation drug repositioning methods, the importance of designing reliable negative samples selection methods, new techniques dealing with the data sparseness problem, the construction of large-scale and comprehensive benchmark data sets and the analysis and explanation of the underlying mechanisms of predicted interactions.


Assuntos
Simulação por Computador , Reposicionamento de Medicamentos , Algoritmos , Teorema de Bayes , Análise por Conglomerados , Biologia Computacional/métodos , Interpretação Estatística de Dados , Aprendizado Profundo , Reprodutibilidade dos Testes , Máquina de Vetores de Suporte
14.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33822883

RESUMO

The rapid increase of genome data brought by gene sequencing technologies poses a massive challenge to data processing. To solve the problems caused by enormous data and complex computing requirements, researchers have proposed many methods and tools which can be divided into three types: big data storage, efficient algorithm design and parallel computing. The purpose of this review is to investigate popular parallel programming technologies for genome sequence processing. Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features. Then, the parallel computing for genome sequence processing is discussed with four common applications: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing, and pattern detection and searching. For each kind of application, its background is firstly introduced, and then a list of tools or algorithms are summarized in the aspects of principle, hardware platform and computing efficiency. The programming model of each hardware and application provides a reference for researchers to choose high-performance computing tools. Finally, we discuss the limitations and future trends of parallel computing technologies.


Assuntos
Processamento Eletrônico de Dados/métodos , Genoma Humano , Genômica/métodos , Polimorfismo de Nucleotídeo Único , Alinhamento de Sequência/métodos , Algoritmos , Sequência de Bases/genética , Mapeamento Cromossômico/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Armazenamento e Recuperação da Informação , Software , Sequenciamento Completo do Genoma/métodos
15.
Brief Bioinform ; 22(4)2021 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-33152756

RESUMO

Drug similarities play an important role in modern biology and medicine, as they help scientists gain deep insights into drugs' therapeutic mechanisms and conduct wet labs that may significantly improve the efficiency of drug research and development. Nowadays, a number of drug-related databases have been constructed, with which many methods have been developed for computing similarities between drugs for studying associations between drugs, human diseases, proteins (drug targets) and more. In this review, firstly, we briefly introduce the publicly available drug-related databases. Secondly, based on different drug features, interaction relationships and multimodal data, we summarize similarity calculation methods in details. Then, we discuss the applications of drug similarities in various biological and medical areas. Finally, we evaluate drug similarity calculation methods with common evaluation metrics to illustrate the important roles of drug similarity measures on different applications.


Assuntos
Biologia Computacional , Bases de Dados de Produtos Farmacêuticos , Descoberta de Drogas , Reposicionamento de Medicamentos , Preparações Farmacêuticas
16.
Brief Bioinform ; 22(2): 1729-1750, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-32118252

RESUMO

Proteins are dominant executors of living processes. Compared to genetic variations, changes in the molecular structure and state of a protein (i.e. proteoforms) are more directly related to pathological changes in diseases. Characterizing proteoforms involves identifying and locating primary structure alterations (PSAs) in proteoforms, which is of practical importance for the advancement of the medical profession. With the development of mass spectrometry (MS) technology, the characterization of proteoforms based on top-down MS technology has become possible. This type of method is relatively new and faces many challenges. Since the proteoform identification is the most important process in characterizing proteoforms, we comprehensively review the existing proteoform identification methods in this study. Before identifying proteoforms, the spectra need to be preprocessed, and protein sequence databases can be filtered to speed up the identification. Therefore, we also summarize some popular deconvolution algorithms, various filtering algorithms for improving the proteoform identification performance and various scoring methods for localizing proteoforms. Moreover, commonly used methods were evaluated and compared in this review. We believe our review could help researchers better understand the current state of the development in this field and design new efficient algorithms for the proteoform characterization.


Assuntos
Espectrometria de Massas/métodos , Proteínas/química , Algoritmos , Sequência de Aminoácidos , Bases de Dados de Proteínas
17.
Bioinformatics ; 38(8): 2226-2234, 2022 04 12.
Artigo em Inglês | MEDLINE | ID: mdl-35150255

RESUMO

MOTIVATION: Many studies have shown that microRNAs (miRNAs) play a key role in human diseases. Meanwhile, traditional experimental methods for miRNA-disease association identification are extremely costly, time-consuming and challenging. Therefore, many computational methods have been developed to predict potential associations between miRNAs and diseases. However, those methods mainly predict the existence of miRNA-disease associations, and they cannot predict the deep-level miRNA-disease association types. RESULTS: In this study, we propose a new end-to-end deep learning method (called PDMDA) to predict deep-level miRNA-disease associations with graph neural networks (GNNs) and miRNA sequence features. Based on the sequence and structural features of miRNAs, PDMDA extracts the miRNA feature representations by a fully connected network (FCN). The disease feature representations are extracted from the disease-gene network and gene-gene interaction network by GNN model. Finally, a multilayer with three fully connected layers and a softmax layer is designed to predict the final miRNA-disease association scores based on the concatenated feature representations of miRNAs and diseases. Note that PDMDA does not take the miRNA-disease association matrix as input to compute the Gaussian interaction profile similarity. We conduct three experiments based on six association type samples (including circulations, epigenetics, target, genetics, known association of which their types are unknown and unknown association samples). We conduct fivefold cross-validation validation to assess the prediction performance of PDMDA. The area under the receiver operating characteristic curve scores is used as metric. The experiment results show that PDMDA can accurately predict the deep-level miRNA-disease associations. AVAILABILITY AND IMPLEMENTATION: Data and source codes are available at https://github.com/27167199/PDMDA.


Assuntos
MicroRNAs , Humanos , MicroRNAs/genética , Algoritmos , Biologia Computacional/métodos , Redes Neurais de Computação , Software
18.
Methods ; 198: 56-64, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-34364986

RESUMO

Complex diseases are caused by a variety of factors, and their diagnosis, treatment and prognosis are usually difficult. Proteins play an indispensable role in living organisms and perform specific biological functions by interacting with other proteins or biomolecules, their dysfunction may lead to diseases, it is a natural way to mine disease-related biomarkers from protein-protein interaction network. AUC, the area under the receiver operating characteristics (ROC) curve, is regarded as a gold standard to evaluate the effectiveness of a binary classifier, which measures the classification ability of an algorithm under arbitrary distribution or any misclassification cost. In this study, we have proposed a network-based multi-biomarker identification method by AUC optimization (NetAUC), which integrates gene expression and the network information to identify biomarkers for the complex disease analysis. The main purpose is to optimize two objectives simultaneously: maximizing AUC and minimizing the number of selected features. We have applied NetAUC to two types of disease analysis: 1) prognosis of breast cancer, 2) classification of similar diseases. The results show that NetAUC can identify a small panel of disease-related biomarkers which have the powerful classification ability and the functional interpretability.


Assuntos
Algoritmos , Neoplasias da Mama , Área Sob a Curva , Biomarcadores , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/genética , Feminino , Humanos , Curva ROC
19.
Nucleic Acids Res ; 49(17): e100, 2021 09 27.
Artigo em Inglês | MEDLINE | ID: mdl-34214175

RESUMO

Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https://github.com/BioinformaticsCSU/LongRepMarker).


Assuntos
Algoritmos , Biologia Computacional/métodos , Genoma/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequências Repetitivas de Ácido Nucleico/genética , Análise de Sequência de DNA/métodos , Animais , Sequência de Bases , Mapeamento Cromossômico/métodos , Simulação por Computador , Bases de Dados Genéticas , Humanos , Internet , Reprodutibilidade dos Testes , Alinhamento de Sequência/métodos
20.
Bioinformatics ; 36(24): 5656-5664, 2021 Apr 05.
Artigo em Inglês | MEDLINE | ID: mdl-33367690

RESUMO

MOTIVATION: Emerging studies indicate that circular RNAs (circRNAs) are widely involved in the progression of human diseases. Due to its special structure which is stable, circRNAs are promising diagnostic and prognostic biomarkers for diseases. However, the experimental verification of circRNA-disease associations is expensive and limited to small-scale. Effective computational methods for predicting potential circRNA-disease associations are regarded as a matter of urgency. Although several models have been proposed, over-reliance on known associations and the absence of characteristics of biological functions make precise predictions are still challenging. RESULTS: In this study, we propose a method for predicting CircRNA-disease associations based on sequence and ontology representations, named CDASOR, with convolutional and recurrent neural networks. For sequences of circRNAs, we encode them with continuous k-mers, get low-dimensional vectors of k-mers, extract their local feature vectors with 1D CNN and learn their long-term dependencies with bi-directional long short-term memory. For diseases, we serialize disease ontology into sentences containing the hierarchy of ontology, obtain low-dimensional vectors for disease ontology terms and get terms' dependencies. Furthermore, we get association patterns of circRNAs and diseases from known circRNA-disease associations with neural networks. After the above steps, we get circRNAs' and diseases' high-level representations, which are informative to improve the prediction. The experimental results show that CDASOR provides an accurate prediction. Importing the characteristics of biological functions, CDASOR achieves impressive predictions in the de novo test. In addition, 6 of the top-10 predicted results are verified by the published literature in the case studies. AVAILABILITY AND IMPLEMENTATION: The code and data of CDASOR are freely available at https://github.com/BioinformaticsCSU/CDASOR.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA