Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 252
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38517696

RESUMEN

With the rapid development of single-molecule sequencing (SMS) technologies, the output read length is continuously increasing. Mapping such reads onto a reference genome is one of the most fundamental tasks in sequence analysis. Mapping sensitivity is becoming a major concern since high sensitivity can detect more aligned regions on the reference and obtain more aligned bases, which are useful for downstream analysis. In this study, we present pathMap, a novel k-mer graph-based mapper that is specifically designed for mapping SMS reads with high sensitivity. By viewing the alignment chain as a path containing as many anchors as possible in the matched k-mer graph, pathMap treats chaining as a path selection problem in the directed graph. pathMap iteratively searches the longest path in the remaining nodes; more candidate chains with high quality can be effectively detected and aligned. Compared to other state-of-the-art mapping methods such as minimap2 and Winnowmap2, experiment results on simulated and real-life datasets demonstrate that pathMap obtains the number of mapped chains at least 11.50% more than its closest competitor and increases the mapping sensitivity by 17.28% and 13.84% of bases over the next-best mapper for Pacific Biosciences and Oxford Nanopore sequencing data, respectively. In addition, pathMap is more robust to sequence errors and more sensitive to species- and strain-specific identification of pathogens using MinION reads.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Secuenciación de Nanoporos , Análisis de Secuencia de ADN/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Genoma , Programas Informáticos , Algoritmos
2.
J Cell Mol Med ; 28(17): e70046, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-39228010

RESUMEN

PIWI-interacting RNAs (piRNAs) are a typical class of small non-coding RNAs, which are essential for gene regulation, genome stability and so on. Accumulating studies have revealed that piRNAs have significant potential as biomarkers and therapeutic targets for a variety of diseases. However current computational methods face the challenge in effectively capturing piRNA-disease associations (PDAs) from limited data. In this study, we propose a novel method, MRDPDA, for predicting PDAs based on limited data from multiple sources. Specifically, MRDPDA integrates a deep factorization machine (deepFM) model with regularizations derived from multiple yet limited datasets, utilizing separate Laplacians instead of a simple average similarity network. Moreover, a unified objective function to combine embedding loss about similarities is proposed to ensure that the embedding is suitable for the prediction task. In addition, a balanced benchmark dataset based on piRPheno is constructed and a deep autoencoder is applied for creating reliable negative set from the unlabeled dataset. Compared with three latest methods, MRDPDA achieves the best performance on the pirpheno dataset in terms of the five-fold cross validation test and independent test set, and case studies further demonstrate the effectiveness of MRDPDA.


Asunto(s)
Biología Computacional , ARN Interferente Pequeño , ARN Interferente Pequeño/genética , ARN Interferente Pequeño/metabolismo , Humanos , Biología Computacional/métodos , Algoritmos , Predisposición Genética a la Enfermedad , Aprendizaje Profundo , ARN de Interacción con Piwi
3.
Brief Bioinform ; 23(3)2022 05 13.
Artículo en Inglés | MEDLINE | ID: mdl-35323901

RESUMEN

MOTIVATION: MicroRNAs (miRNAs), as critical regulators, are involved in various fundamental and vital biological processes, and their abnormalities are closely related to human diseases. Predicting disease-related miRNAs is beneficial to uncovering new biomarkers for the prevention, detection, prognosis, diagnosis and treatment of complex diseases. RESULTS: In this study, we propose a multi-view Laplacian regularized deep factorization machine (DeepFM) model, MLRDFM, to predict novel miRNA-disease associations while improving the standard DeepFM. Specifically, MLRDFM improves DeepFM from two aspects: first, MLRDFM takes the relationships among items into consideration by regularizing their embedding features via their similarity-based Laplacians. In this study, miRNA Laplacian regularization integrates four types of miRNA similarity, while disease Laplacian regularization integrates two types of disease similarity. Second, to judiciously train our model, Laplacian eigenmaps are utilized to initialize the weights in the dense embedding layer. The experimental results on the latest HMDD v3.2 dataset show that MLRDFM improves the performance and reduces the overfitting phenomenon of DeepFM. Besides, MLRDFM is greatly superior to the state-of-the-art models in miRNA-disease association prediction in terms of different evaluation metrics with the 5-fold cross-validation. Furthermore, case studies further demonstrate the effectiveness of MLRDFM.


Asunto(s)
MicroARNs , Algoritmos , Biología Computacional/métodos , Predisposición Genética a la Enfermedad , Humanos , MicroARNs/genética
4.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34864856

RESUMEN

Drug repositioning is proposed to find novel usages for existing drugs. Among many types of drug repositioning approaches, predicting drug-drug interactions (DDIs) helps explore the pharmacological functions of drugs and achieves potential drugs for novel treatments. A number of models have been applied to predict DDIs. The DDI network, which is constructed from the known DDIs, is a common part in many of the existing methods. However, the functions of DDIs are different, and thus integrating them in a single DDI graph may overlook some useful information. We propose a graph convolutional network with multi-kernel (GCNMK) to predict potential DDIs. GCNMK adopts two DDI graph kernels for the graph convolutional layers, namely, increased DDI graph consisting of 'increase'-related DDIs and decreased DDI graph consisting of 'decrease'-related DDIs. The learned drug features are fed into a block with three fully connected layers for the DDI prediction. We compare various types of drug features, whereas the target feature of drugs outperforms all other types of features and their concatenated features. In comparison with three different DDI prediction methods, our proposed GCNMK achieves the best performance in terms of area under receiver operating characteristic curve and area under precision-recall curve. In case studies, we identify the top 20 potential DDIs from all unknown DDIs, and the top 10 potential DDIs from the unknown DDIs among breast, colorectal and lung neoplasms-related drugs. Most of them have evidence to support the existence of their interactions. fangxiang.wu@usask.ca.


Asunto(s)
Algoritmos , Reposicionamiento de Medicamentos , Interacciones Farmacológicas , Curva ROC
5.
Brief Bioinform ; 23(3)2022 05 13.
Artículo en Inglés | MEDLINE | ID: mdl-35275996

RESUMEN

MOTIVATION: Identifying disease-related genes is an important issue in computational biology. Module structure widely exists in biomolecule networks, and complex diseases are usually thought to be caused by perturbations of local neighborhoods in the networks, which can provide useful insights for the study of disease-related genes. However, the mining and effective utilization of the module structure is still challenging in such issues as a disease gene prediction. RESULTS: We propose a hybrid disease-gene prediction method integrating multiscale module structure (HyMM), which can utilize multiscale information from local to global structure to more effectively predict disease-related genes. HyMM extracts module partitions from local to global scales by multiscale modularity optimization with exponential sampling, and estimates the disease relatedness of genes in partitions by the abundance of disease-related genes within modules. Then, a probabilistic model for integration of gene rankings is designed in order to integrate multiple predictions derived from multiscale module partitions and network propagation, and a parameter estimation strategy based on functional information is proposed to further enhance HyMM's predictive power. By a series of experiments, we reveal the importance of module partitions at different scales, and verify the stable and good performance of HyMM compared with eight other state-of-the-arts and its further performance improvement derived from the parameter estimation. CONCLUSIONS: The results confirm that HyMM is an effective framework for integrating multiscale module structure to enhance the ability to predict disease-related genes, which may provide useful insights for the study of the multiscale module structure and its application in such issues as a disease-gene prediction.


Asunto(s)
Algoritmos , Biología Computacional , Biología Computacional/métodos , Modelos Estadísticos , Proteínas
6.
Brief Bioinform ; 23(2)2022 03 10.
Artículo en Inglés | MEDLINE | ID: mdl-35136949

RESUMEN

In recent decades, exploring potential relationships between diseases has been an active research field. With the rapid accumulation of disease-related biomedical data, a lot of computational methods and tools/platforms have been developed to reveal intrinsic relationship between diseases, which can provide useful insights to the study of complex diseases, e.g. understanding molecular mechanisms of diseases and discovering new treatment of diseases. Human complex diseases involve both external phenotypic abnormalities and complex internal molecular mechanisms in organisms. Computational methods with different types of biomedical data from phenotype to genotype can evaluate disease-disease associations at different levels, providing a comprehensive perspective for understanding diseases. In this review, available biomedical data and databases for evaluating disease-disease associations are first summarized. Then, existing computational methods for disease-disease associations are reviewed and classified into five groups in terms of the usages of biomedical data, including disease semantic-based, phenotype-based, function-based, representation learning-based and text mining-based methods. Further, we summarize software tools/platforms for computation and analysis of disease-disease associations. Finally, we give a discussion and summary on the research of disease-disease associations. This review provides a systematic overview for current disease association research, which could promote the development and applications of computational methods and tools/platforms for disease-disease associations.


Asunto(s)
Biología Computacional , Minería de Datos , Biología Computacional/métodos , Minería de Datos/métodos , Bases de Datos Factuales , Fenotipo , Programas Informáticos
7.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34498677

RESUMEN

Long non-coding RNAs (lncRNAs) are a class of RNA molecules with more than 200 nucleotides. A growing amount of evidence reveals that subcellular localization of lncRNAs can provide valuable insights into their biological functions. Existing computational methods for predicting lncRNA subcellular localization use k-mer features to encode lncRNA sequences. However, the sequence order information is lost by using only k-mer features. We proposed a deep learning framework, DeepLncLoc, to predict lncRNA subcellular localization. In DeepLncLoc, we introduced a new subsequence embedding method that keeps the order information of lncRNA sequences. The subsequence embedding method first divides a sequence into some consecutive subsequences and then extracts the patterns of each subsequence, last combines these patterns to obtain a complete representation of the lncRNA sequence. After that, a text convolutional neural network is employed to learn high-level features and perform the prediction task. Compared with traditional machine learning models, popular representation methods and existing predictors, DeepLncLoc achieved better performance, which shows that DeepLncLoc could effectively predict lncRNA subcellular localization. Our study not only presented a novel computational model for predicting lncRNA subcellular localization but also introduced a new subsequence embedding method which is expected to be applied in other sequence-based prediction tasks. The DeepLncLoc web server is freely accessible at http://bioinformatics.csu.edu.cn/DeepLncLoc/, and source code and datasets can be downloaded from https://github.com/CSUBioGroup/DeepLncLoc.


Asunto(s)
Aprendizaje Profundo , ARN Largo no Codificante , Biología Computacional/métodos , Redes Neurales de la Computación , ARN Largo no Codificante/genética , Programas Informáticos
8.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34953465

RESUMEN

Alzheimer's disease (AD) has a strong genetic predisposition. However, its risk genes remain incompletely identified. We developed an Alzheimer's brain gene network-based approach to predict AD-associated genes by leveraging the functional pattern of known AD-associated genes. Our constructed network outperformed existing networks in predicting AD genes. We then systematically validated the predictions using independent genetic, transcriptomic, proteomic data, neuropathological and clinical data. First, top-ranked genes were enriched in AD-associated pathways. Second, using external gene expression data from the Mount Sinai Brain Bank study, we found that the top-ranked genes were significantly associated with neuropathological and clinical traits, including the Consortium to Establish a Registry for Alzheimer's Disease score, Braak stage score and clinical dementia rating. The analysis of Alzheimer's brain single-cell RNA-seq data revealed cell-type-specific association of predicted genes with early pathology of AD. Third, by interrogating proteomic data in the Religious Orders Study and Memory and Aging Project and Baltimore Longitudinal Study of Aging studies, we observed a significant association of protein expression level with cognitive function and AD clinical severity. The network, method and predictions could become a valuable resource to advance the identification of risk genes for AD.


Asunto(s)
Enfermedad de Alzheimer/genética , Enfermedad de Alzheimer/metabolismo , Encéfalo/metabolismo , Redes Reguladoras de Genes , Predisposición Genética a la Enfermedad , Envejecimiento/genética , Perfilación de la Expresión Génica , Humanos , Estudios Longitudinales , Memoria , Proteómica , RNA-Seq , Transcriptoma
9.
Bioinformatics ; 39(1)2023 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-36458923

RESUMEN

MOTIVATION: Protein essentiality is usually accepted to be a conditional trait and strongly affected by cellular environments. However, existing computational methods often do not take such characteristics into account, preferring to incorporate all available data and train a general model for all cell lines. In addition, the lack of model interpretability limits further exploration and analysis of essential protein predictions. RESULTS: In this study, we proposed DeepCellEss, a sequence-based interpretable deep learning framework for cell line-specific essential protein predictions. DeepCellEss utilizes a convolutional neural network and bidirectional long short-term memory to learn short- and long-range latent information from protein sequences. Further, a multi-head self-attention mechanism is used to provide residue-level model interpretability. For model construction, we collected extremely large-scale benchmark datasets across 323 cell lines. Extensive computational experiments demonstrate that DeepCellEss yields effective prediction performance for different cell lines and outperforms existing sequence-based methods as well as network-based centrality measures. Finally, we conducted some case studies to illustrate the necessity of considering specific cell lines and the superiority of DeepCellEss. We believe that DeepCellEss can serve as a useful tool for predicting essential proteins across different cell lines. AVAILABILITY AND IMPLEMENTATION: The DeepCellEss web server is available at http://csuligroup.com:8000/DeepCellEss. The source code and data underlying this study can be obtained from https://github.com/CSUBioGroup/DeepCellEss. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Aprendizaje Profundo , Proteínas/metabolismo , Secuencia de Aminoácidos , Programas Informáticos , Línea Celular , Biología Computacional/métodos
10.
Bioinformatics ; 39(12)2023 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-38058196

RESUMEN

MOTIVATION: Longer reads produced by PacBio or Oxford Nanopore sequencers could more frequently span the breakpoints of structural variations (SVs) than shorter reads. Therefore, existing long-read mapping methods often generate wrong alignments and variant calls. Compared to deletions and insertions, inversion events are more difficult to be detected since the anchors in inversion regions are nonlinear to those in SV-free regions. To address this issue, this study presents a novel long-read mapping algorithm (named as invMap). RESULTS: For each long noisy read, invMap first locates the aligned region with a specifically designed scoring method for chaining, then checks the remaining anchors in the aligned region to discover potential inversions. We benchmark invMap on simulated datasets across different genomes and sequencing coverages, experimental results demonstrate that invMap is more accurate to locate aligned regions and call SVs for inversions than the competing methods. The real human genome sequencing dataset of NA12878 illustrates that invMap can effectively find more candidate variant calls for inversions than the competing methods. AVAILABILITY AND IMPLEMENTATION: The invMap software is available at https://github.com/zhang134/invMap.git.


Asunto(s)
Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Algoritmos , Genoma Humano , Inversión Cromosómica , Análisis de Secuencia de ADN/métodos
11.
Methods ; 216: 21-38, 2023 08.
Artículo en Inglés | MEDLINE | ID: mdl-37315825

RESUMEN

Single-cell RNA-sequencing (scRNA-seq) data suffer from a lot of zeros. Such dropout events impede the downstream data analyses. We propose BayesImpute to infer and impute dropouts from the scRNA-seq data. Using the expression rate and coefficient of variation of the genes within the cell subpopulation, BayesImpute first determines likely dropouts, and then constructs the posterior distribution for each gene and uses the posterior mean to impute dropout values. Some simulated and real experiments show that BayesImpute can effectively identify dropout events and reduce the introduction of false positive signals. Additionally, BayesImpute successfully recovers the true expression levels of missing values, restores the gene-to-gene and cell-to-cell correlation coefficient, and maintains the biological information in bulk RNA-seq data. Furthermore, BayesImpute boosts the clustering and visualization of cell subpopulations and improves the identification of differentially expressed genes. We further demonstrate that, in comparison to other statistical-based imputation methods, BayesImpute is scalable and fast with minimal memory usage.


Asunto(s)
Análisis de Expresión Génica de una Sola Célula , Programas Informáticos , Análisis de Secuencia de ARN/métodos , Teorema de Bayes , Análisis de la Célula Individual/métodos , Probabilidad , Perfilación de la Expresión Génica
12.
Brief Bioinform ; 22(2): 1604-1619, 2021 03 22.
Artículo en Inglés | MEDLINE | ID: mdl-32043521

RESUMEN

Drug repositioning can drastically decrease the cost and duration taken by traditional drug research and development while avoiding the occurrence of unforeseen adverse events. With the rapid advancement of high-throughput technologies and the explosion of various biological data and medical data, computational drug repositioning methods have been appealing and powerful techniques to systematically identify potential drug-target interactions and drug-disease interactions. In this review, we first summarize the available biomedical data and public databases related to drugs, diseases and targets. Then, we discuss existing drug repositioning approaches and group them based on their underlying computational models consisting of classical machine learning, network propagation, matrix factorization and completion, and deep learning based models. We also comprehensively analyze common standard data sets and evaluation metrics used in drug repositioning, and give a brief comparison of various prediction methods on the gold standard data sets. Finally, we conclude our review with a brief discussion on challenges in computational drug repositioning, which includes the problem of reducing the noise and incompleteness of biomedical data, the ensemble of various computation drug repositioning methods, the importance of designing reliable negative samples selection methods, new techniques dealing with the data sparseness problem, the construction of large-scale and comprehensive benchmark data sets and the analysis and explanation of the underlying mechanisms of predicted interactions.


Asunto(s)
Simulación por Computador , Reposicionamiento de Medicamentos , Algoritmos , Teorema de Bayes , Análisis por Conglomerados , Biología Computacional/métodos , Interpretación Estadística de Datos , Aprendizaje Profundo , Reproducibilidad de los Resultados , Máquina de Vectores de Soporte
13.
Brief Bioinform ; 22(3)2021 05 20.
Artículo en Inglés | MEDLINE | ID: mdl-32427285

RESUMEN

Advances in sequencing technologies facilitate personalized disease-risk profiling and clinical diagnosis. In recent years, some great progress has been made in noninvasive diagnoses based on cell-free DNAs (cfDNAs). It exploits the fact that dead cells release DNA fragments into the circulation, and some DNA fragments carry information that indicates their tissues-of-origin (TOOs). Based on the signals used for identifying the TOOs of cfDNAs, the existing methods can be classified into three categories: cfDNA mutation-based methods, methylation pattern-based methods and cfDNA fragmentation pattern-based methods. In cfDNA mutation-based methods, the SNP information or the detected mutations in driven genes of certain diseases are employed to identify the TOOs of cfDNAs. Methylation pattern-based methods are developed to identify the TOOs of cfDNAs based on the tissue-specific methylation patterns. In cfDNA fragmentation pattern-based methods, cfDNA fragmentation patterns, such as nucleosome positioning or preferred end coordinates of cfDNAs, are used to predict the TOOs of cfDNAs. In this paper, the strategies and challenges in each category are reviewed. Furthermore, the representative applications based on the TOOs of cfDNAs, including noninvasive prenatal testing, noninvasive cancer screening, transplantation rejection monitoring and parasitic infection detection, are also reviewed. Moreover, the challenges and future work in identifying the TOOs of cfDNAs are discussed. Our research provides a comprehensive picture of the development and challenges in identifying the TOOs of cfDNAs, which may benefit bioinformatics researchers to develop new methods to improve the identification of the TOOs of cfDNAs.


Asunto(s)
Ácidos Nucleicos Libres de Células/genética , Neoplasias/diagnóstico , Biomarcadores de Tumor/genética , Ácidos Nucleicos Libres de Células/sangre , Metilación de ADN , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Mutación , Neoplasias/genética
14.
Brief Bioinform ; 22(5)2021 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-33822883

RESUMEN

The rapid increase of genome data brought by gene sequencing technologies poses a massive challenge to data processing. To solve the problems caused by enormous data and complex computing requirements, researchers have proposed many methods and tools which can be divided into three types: big data storage, efficient algorithm design and parallel computing. The purpose of this review is to investigate popular parallel programming technologies for genome sequence processing. Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features. Then, the parallel computing for genome sequence processing is discussed with four common applications: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing, and pattern detection and searching. For each kind of application, its background is firstly introduced, and then a list of tools or algorithms are summarized in the aspects of principle, hardware platform and computing efficiency. The programming model of each hardware and application provides a reference for researchers to choose high-performance computing tools. Finally, we discuss the limitations and future trends of parallel computing technologies.


Asunto(s)
Procesamiento Automatizado de Datos/métodos , Genoma Humano , Genómica/métodos , Polimorfismo de Nucleótido Simple , Alineación de Secuencia/métodos , Algoritmos , Secuencia de Bases/genética , Mapeo Cromosómico/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Almacenamiento y Recuperación de la Información , Programas Informáticos , Secuenciación Completa del Genoma/métodos
15.
Brief Bioinform ; 22(3)2021 05 20.
Artículo en Inglés | MEDLINE | ID: mdl-34020541

RESUMEN

Various microbes have proved to be closely related to the pathogenesis of human diseases. While many computational methods for predicting human microbe-disease associations (MDAs) have been developed, few systematic reviews on these methods have been reported. In this study, we provide a comprehensive overview of the existing methods. Firstly, we introduce the data used in existing MDA prediction methods. Secondly, we classify those methods into different categories by their nature and describe their algorithms and strategies in detail. Next, experimental evaluations are conducted on representative methods using different similarity data and calculation methods to compare their prediction performances. Based on the principles of computational methods and experimental results, we discuss the advantages and disadvantages of those methods and propose suggestions for the improvement of prediction performances. Considering the problems of the MDA prediction at present stage, we discuss future work from three perspectives including data, methods and formulations at the end.


Asunto(s)
Algoritmos , Simulación por Computador , Bases de Datos Factuales , Enfermedad , Microbiota , Modelos Biológicos , Biología Computacional , Humanos
16.
Brief Bioinform ; 22(4)2021 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-33152756

RESUMEN

Drug similarities play an important role in modern biology and medicine, as they help scientists gain deep insights into drugs' therapeutic mechanisms and conduct wet labs that may significantly improve the efficiency of drug research and development. Nowadays, a number of drug-related databases have been constructed, with which many methods have been developed for computing similarities between drugs for studying associations between drugs, human diseases, proteins (drug targets) and more. In this review, firstly, we briefly introduce the publicly available drug-related databases. Secondly, based on different drug features, interaction relationships and multimodal data, we summarize similarity calculation methods in details. Then, we discuss the applications of drug similarities in various biological and medical areas. Finally, we evaluate drug similarity calculation methods with common evaluation metrics to illustrate the important roles of drug similarity measures on different applications.


Asunto(s)
Biología Computacional , Bases de Datos Farmacéuticas , Descubrimiento de Drogas , Reposicionamiento de Medicamentos , Preparaciones Farmacéuticas
17.
Brief Bioinform ; 22(2): 1729-1750, 2021 03 22.
Artículo en Inglés | MEDLINE | ID: mdl-32118252

RESUMEN

Proteins are dominant executors of living processes. Compared to genetic variations, changes in the molecular structure and state of a protein (i.e. proteoforms) are more directly related to pathological changes in diseases. Characterizing proteoforms involves identifying and locating primary structure alterations (PSAs) in proteoforms, which is of practical importance for the advancement of the medical profession. With the development of mass spectrometry (MS) technology, the characterization of proteoforms based on top-down MS technology has become possible. This type of method is relatively new and faces many challenges. Since the proteoform identification is the most important process in characterizing proteoforms, we comprehensively review the existing proteoform identification methods in this study. Before identifying proteoforms, the spectra need to be preprocessed, and protein sequence databases can be filtered to speed up the identification. Therefore, we also summarize some popular deconvolution algorithms, various filtering algorithms for improving the proteoform identification performance and various scoring methods for localizing proteoforms. Moreover, commonly used methods were evaluated and compared in this review. We believe our review could help researchers better understand the current state of the development in this field and design new efficient algorithms for the proteoform characterization.


Asunto(s)
Espectrometría de Masas/métodos , Proteínas/química , Algoritmos , Secuencia de Aminoácidos , Bases de Datos de Proteínas
18.
Bioinformatics ; 38(8): 2226-2234, 2022 04 12.
Artículo en Inglés | MEDLINE | ID: mdl-35150255

RESUMEN

MOTIVATION: Many studies have shown that microRNAs (miRNAs) play a key role in human diseases. Meanwhile, traditional experimental methods for miRNA-disease association identification are extremely costly, time-consuming and challenging. Therefore, many computational methods have been developed to predict potential associations between miRNAs and diseases. However, those methods mainly predict the existence of miRNA-disease associations, and they cannot predict the deep-level miRNA-disease association types. RESULTS: In this study, we propose a new end-to-end deep learning method (called PDMDA) to predict deep-level miRNA-disease associations with graph neural networks (GNNs) and miRNA sequence features. Based on the sequence and structural features of miRNAs, PDMDA extracts the miRNA feature representations by a fully connected network (FCN). The disease feature representations are extracted from the disease-gene network and gene-gene interaction network by GNN model. Finally, a multilayer with three fully connected layers and a softmax layer is designed to predict the final miRNA-disease association scores based on the concatenated feature representations of miRNAs and diseases. Note that PDMDA does not take the miRNA-disease association matrix as input to compute the Gaussian interaction profile similarity. We conduct three experiments based on six association type samples (including circulations, epigenetics, target, genetics, known association of which their types are unknown and unknown association samples). We conduct fivefold cross-validation validation to assess the prediction performance of PDMDA. The area under the receiver operating characteristic curve scores is used as metric. The experiment results show that PDMDA can accurately predict the deep-level miRNA-disease associations. AVAILABILITY AND IMPLEMENTATION: Data and source codes are available at https://github.com/27167199/PDMDA.


Asunto(s)
MicroARNs , Humanos , MicroARNs/genética , Algoritmos , Biología Computacional/métodos , Redes Neurales de la Computación , Programas Informáticos
19.
Methods ; 198: 56-64, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-34364986

RESUMEN

Complex diseases are caused by a variety of factors, and their diagnosis, treatment and prognosis are usually difficult. Proteins play an indispensable role in living organisms and perform specific biological functions by interacting with other proteins or biomolecules, their dysfunction may lead to diseases, it is a natural way to mine disease-related biomarkers from protein-protein interaction network. AUC, the area under the receiver operating characteristics (ROC) curve, is regarded as a gold standard to evaluate the effectiveness of a binary classifier, which measures the classification ability of an algorithm under arbitrary distribution or any misclassification cost. In this study, we have proposed a network-based multi-biomarker identification method by AUC optimization (NetAUC), which integrates gene expression and the network information to identify biomarkers for the complex disease analysis. The main purpose is to optimize two objectives simultaneously: maximizing AUC and minimizing the number of selected features. We have applied NetAUC to two types of disease analysis: 1) prognosis of breast cancer, 2) classification of similar diseases. The results show that NetAUC can identify a small panel of disease-related biomarkers which have the powerful classification ability and the functional interpretability.


Asunto(s)
Algoritmos , Neoplasias de la Mama , Área Bajo la Curva , Biomarcadores , Neoplasias de la Mama/diagnóstico , Neoplasias de la Mama/genética , Femenino , Humanos , Curva ROC
20.
Nucleic Acids Res ; 49(17): e100, 2021 09 27.
Artículo en Inglés | MEDLINE | ID: mdl-34214175

RESUMEN

Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https://github.com/BioinformaticsCSU/LongRepMarker).


Asunto(s)
Algoritmos , Biología Computacional/métodos , Genoma/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuencias Repetitivas de Ácidos Nucleicos/genética , Análisis de Secuencia de ADN/métodos , Animales , Secuencia de Bases , Mapeo Cromosómico/métodos , Simulación por Computador , Bases de Datos Genéticas , Humanos , Internet , Reproducibilidad de los Resultados , Alineación de Secuencia/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA