Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
BMC Bioinformatics ; 19(Suppl 13): 554, 2019 Feb 04.
Artigo em Inglês | MEDLINE | ID: mdl-30717666

RESUMO

BACKGROUND: In silico prediction of potential drug side-effects is of crucial importance for drug development, since wet experimental identification of drug side-effects is expensive and time-consuming. Existing computational methods mainly focus on leveraging validated drug side-effect relations for the prediction. The performance is severely impeded by the lack of reliable negative training data. Thus, a method to select reliable negative samples becomes vital in the performance improvement. METHODS: Most of the existing computational prediction methods are essentially based on the assumption that similar drugs are inclined to share the same side-effects, which has given rise to remarkable performance. It is also rational to assume an inverse proposition that dissimilar drugs are less likely to share the same side-effects. Based on this inverse similarity hypothesis, we proposed a novel method to select highly-reliable negative samples for side-effect prediction. The first step of our method is to build a drug similarity integration framework to measure the similarity between drugs from different perspectives. This step integrates drug chemical structures, drug target proteins, drug substituents, and drug therapeutic information as features into a unified framework. Then, a similarity score between each candidate negative drug and validated positive drugs is calculated using the similarity integration framework. Those candidate negative drugs with lower similarity scores are preferentially selected as negative samples. Finally, both the validated positive drugs and the selected highly-reliable negative samples are used for predictions. RESULTS: The performance of the proposed method was evaluated on simulative side-effect prediction of 917 DrugBank drugs, comparing with four machine-learning algorithms. Extensive experiments show that the drug similarity integration framework has superior capability in capturing drug features, achieving much better performance than those based on a single type of drug property. Besides, the four machine-learning algorithms achieved significant improvement in macro-averaging F1-score (e.g., SVM from 0.655 to 0.898), macro-averaging precision (e.g., RBF from 0.592 to 0.828) and macro-averaging recall (e.g., KNN from 0.651 to 0.772) complimentarily attributed to the highly-reliable negative samples selected by the proposed method. CONCLUSIONS: The results suggest that the inverse similarity hypothesis and the integration of different drug properties are valuable for side-effect prediction. The selection of highly-reliable negative samples can also make significant contributions to the performance improvement.


Assuntos
Biologia Computacional/métodos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/diagnóstico , Algoritmos , Bases de Dados como Assunto , Humanos , Aprendizado de Máquina , Proteínas/química
2.
BMC Genomics ; 20(Suppl 9): 943, 2019 Dec 24.
Artigo em Inglês | MEDLINE | ID: mdl-31874629

RESUMO

BACKGROUND: A long noncoding RNA (lncRNA) can act as a competing endogenous RNA (ceRNA) to compete with an mRNA for binding to the same miRNA. Such an interplay between the lncRNA, miRNA, and mRNA is called a ceRNA crosstalk. As an miRNA may have multiple lncRNA targets and multiple mRNA targets, connecting all the ceRNA crosstalks mediated by the same miRNA forms a ceRNA network. Methods have been developed to construct ceRNA networks in the literature. However, these methods have limits because they have not explored the expression characteristics of total RNAs. RESULTS: We proposed a novel method for constructing ceRNA networks and applied it to a paired RNA-seq data set. The first step of the method takes a competition regulation mechanism to derive candidate ceRNA crosstalks. Second, the method combines a competition rule and pointwise mutual information to compute a competition score for each candidate ceRNA crosstalk. Then, ceRNA crosstalks which have significant competition scores are selected to construct the ceRNA network. The key idea, pointwise mutual information, is ideally suitable for measuring the complex point-to-point relationships embedded in the ceRNA networks. CONCLUSION: Computational experiments and results demonstrate that the ceRNA networks can capture important regulatory mechanism of breast cancer, and have also revealed new insights into the treatment of breast cancer. The proposed method can be directly applied to other RNA-seq data sets for deeper disease understanding.


Assuntos
MicroRNAs/metabolismo , RNA Longo não Codificante/metabolismo , RNA Mensageiro/metabolismo , RNA-Seq , Neoplasias da Mama/genética , Neoplasias da Mama/metabolismo , Neoplasias da Mama/terapia , Feminino , Humanos
3.
BMC Genomics ; 19(1): 574, 2018 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-30068294

RESUMO

BACKGROUND: N6-methyladenosine (m6A) is an important epigenetic modification which plays various roles in mRNA metabolism and embryogenesis directly related to human diseases. To identify m6A in a large scale, machine learning methods have been developed to make predictions on m6A sites. However, there are two main drawbacks of these methods. The first is the inadequate learning of the imbalanced m6A samples which are much less than the non-m6A samples, by their balanced learning approaches. Second, the features used by these methods are not outstanding to represent m6A sequence characteristics. RESULTS: We propose to use cost-sensitive learning ideas to resolve the imbalance data issues in the human mRNA m6A prediction problem. This cost-sensitive approach applies to the entire imbalanced dataset, without random equal-size selection of negative samples, for an adequate learning. Along with site location and entropy features, top-ranked positions with the highest single nucleotide polymorphism specificity in the window sequences are taken as new features in our imbalance learning. On an independent dataset, our overall prediction performance is much superior to the existing predictors. Our method shows stronger robustness against the imbalance changes in the tests on 9 datasets whose imbalance ratios range from 1:1 to 9:1. Our method also outperforms the existing predictors on 1226 individual transcripts. It is found that the new types of features are indeed of high significance in the m6A prediction. The case studies on gene c-Jun and CBFB demonstrate the detailed prediction capacity to improve the prediction performance. CONCLUSION: The proposed cost-sensitive model and the new features are useful in human mRNA m6A prediction. Our method achieves better correctness and robustness than the existing predictors in independent test and case studies. The results suggest that imbalance learning is promising to improve the performance of m6A prediction.


Assuntos
Adenosina/análogos & derivados , Biologia Computacional/métodos , RNA Mensageiro/química , Adenosina/análise , Algoritmos , Biologia Computacional/economia , Humanos , Aprendizado de Máquina
4.
BMC Bioinformatics ; 18(1): 193, 2017 Mar 24.
Artigo em Inglês | MEDLINE | ID: mdl-28340554

RESUMO

BACKGROUND: MicroRNAs always function cooperatively in their regulation of gene expression. Dysfunctions of these co-functional microRNAs can play significant roles in disease development. We are interested in those multi-disease associated co-functional microRNAs that regulate their common dysfunctional target genes cooperatively in the development of multiple diseases. The research is potentially useful for human disease studies at the transcriptional level and for the study of multi-purpose microRNA therapeutics. METHODS AND RESULTS: We designed a computational method to detect multi-disease associated co-functional microRNA pairs and conducted cross disease analysis on a reconstructed disease-gene-microRNA (DGR) tripartite network. The construction of the DGR tripartite network is by the integration of newly predicted disease-microRNA associations with those relationships of diseases, microRNAs and genes maintained by existing databases. The prediction method uses a set of reliable negative samples of disease-microRNA association and a pre-computed kernel matrix instead of kernel functions. From this reconstructed DGR tripartite network, multi-disease associated co-functional microRNA pairs are detected together with their common dysfunctional target genes and ranked by a novel scoring method. We also conducted proof-of-concept case studies on cancer-related co-functional microRNA pairs as well as on non-cancer disease-related microRNA pairs. CONCLUSIONS: With the prioritization of the co-functional microRNAs that relate to a series of diseases, we found that the co-function phenomenon is not unusual. We also confirmed that the regulation of the microRNAs for the development of cancers is more complex and have more unique properties than those of non-cancer diseases.


Assuntos
Biologia Computacional/métodos , MicroRNAs/genética , Humanos
5.
BMC Bioinformatics ; 17(Suppl 19): 507, 2016 Dec 22.
Artigo em Inglês | MEDLINE | ID: mdl-28155659

RESUMO

BACKGROUND: Regulation mechanisms between miRNAs and genes are complicated. To accomplish a biological function, a miRNA may regulate multiple target genes, and similarly a target gene may be regulated by multiple miRNAs. Wet-lab knowledge of co-regulating miRNAs is limited. This work introduces a computational method to group miRNAs of similar functions to identify co-regulating miRNAsfrom a similarity matrix of miRNAs. RESULTS: We define a novel information content of gene ontology (GO) to measure similarity between two sets of GO graphs corresponding to the two sets of target genes of two miRNAs. This between-graph similarity is then transferred as a functional similarity between the two miRNAs. Our definition of the information content is based on the size of a GO term's descendants, but adjusted by a weight derived from its depth level and the GO relationships at its path to the root node or to the most informative common ancestor (MICA). Further, a self-tuning technique and the eigenvalues of the normalized Laplacian matrix are applied to determine the optimal parameters for the spectral clustering of the similarity matrix of the miRNAs. CONCLUSIONS: Experimental results demonstrate that our method has better clustering performance than the existing edge-based, node-based or hybrid methods. Our method has also demonstrated a novel usefulness for the function annotation of new miRNAs, as reported in the detailed case studies.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica , Ontologia Genética , MicroRNAs/genética , Modelos Estatísticos , Algoritmos , Análise por Conglomerados , Humanos
6.
BMC Med Genomics ; 11(Suppl 6): 118, 2018 Dec 31.
Artigo em Inglês | MEDLINE | ID: mdl-30598116

RESUMO

BACKGROUND: Gene expression-based profiling has been used to identify biomarkers for different breast cancer subtypes. However, this technique has many limitations. IsomiRs are isoforms of miRNAs that have critical roles in many biological processes and have been successfully used to distinguish various cancer types. Biomarker isomiRs for identifying different breast cancer subtypes has not been investigated. For the first time, we aim to show that isomiRs are better performing biomarkers and use them to explain molecular differences between breast cancer subtypes. RESULTS: In this study, a novel method is proposed to identify specific isomiRs that faithfully classify breast cancer subtypes. First, as a null hypothesis method we removed the lowly expressed isomiRs from small sequencing data generated from diverse breast cancers types. Second, we developed an improved mutual information-based feature selection method to calculate the weight of each isomiR expression. The weight of isomiR measures the importance of a given isomiR in classifying breast cancer subtypes. The improved mutual information enables to apply the dataset in which the feature is continuous data and label is discrete data; whereby, the traditional mutual information cannot be applied in this dataset. Finally, the support vector machine (SVM) classifier is applied to find isomiR biomarkers for subtyping. CONCLUSIONS: Here we demonstrate that isomiRs can be used as biomarkers in the identification of different breast cancer subtypes, and in addition, they may provide new insights into the diverse molecular mechanisms of breast cancers. We have also shown that the classification of different subtypes of breast cancer based on isomiRs expression is more effective than using published gene expression profiling. The proposed method provides a better performance outcome than Fisher method and Hellinger method for discovering biomarkers to distinguish different breast cancer subtypes. This novel technique could be directly applied to identify biomarkers in other diseases.


Assuntos
Neoplasias da Mama/classificação , MicroRNAs , RNA Neoplásico , Biomarcadores Tumorais , Neoplasias da Mama/genética , Conjuntos de Dados como Assunto , Humanos , MicroRNAs/genética , Isoformas de RNA
7.
Artigo em Inglês | MEDLINE | ID: mdl-29990239

RESUMO

The latest sequencing technologies such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines can generate long reads at the length of thousands of nucleic bases which is much longer than the reads at the length of hundreds generated by Illumina machines. However, these long reads are prone to much higher error rates, for example 15%, making downstream analysis and applications very difficult. Error correction is a process to improve the quality of sequencing data. Hybrid correction strategies have been recently proposed to combine Illumina reads of low error rates to fix sequencing errors in the noisy long reads with good performance. In this paper, we propose a new method named Bicolor, a bi-level framework of hybrid error correction for further improving the quality of PacBio long reads. At the first level, our method uses a de Bruijn graph-based error correction idea to search paths in pairs of solid -mers iteratively with an increasing length of -mer. At the second level, we combine the processed results under different parameters from the first level. In particular, a multiple sequence alignment algorithm is used to align those similar long reads, followed by a voting algorithm which determines the final base at each position of the reads. We compare the superior performance of Bicolor with three state-of-the-art methods on three real data sets. Results demonstrate that Bicolor always achieves the highest identity ratio. Bicolor also achieves a higher alignment ratio () and a higher number of aligned reads than the current methods on two data sets. On the third data set, our method is closely competitive to the current methods in terms of number of aligned reads and genome coverage. The C++ source codes of our algorithm are freely available at https://github.com/yuansliu/Bicolor.

8.
IEEE/ACM Trans Comput Biol Bioinform ; 14(5): 1134-1146, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28026781

RESUMO

Frequently recurring RNA structural motifs play important roles in RNA folding process and interaction with other molecules. Traditional index-based and shape-based schemas are useful in modeling RNA secondary structures but ignore the structural discrepancy of individual RNA family member. Further, the in-depth analysis of underlying substructure pattern is insufficient due to varied and unnormalized substructure data. This prevents us from understanding RNAs functions and their inherent synergistic regulation networks. This article thus proposes a novel labeled graph-based algorithm RnaGraph to uncover frequently RNA substructure patterns. Attribute data and graph data are combined to characterize diverse substructures and their correlations, respectively. Further, a top-k graph pattern mining algorithm is developed to extract interesting substructure motifs by integrating frequency and similarity. The experimental results show that our methods assist in not only modelling complex RNA secondary structures but also identifying hidden but interesting RNA substructure patterns.


Assuntos
Biologia Computacional/métodos , Conformação de Ácido Nucleico , Reconhecimento Automatizado de Padrão/métodos , RNA/química , Análise de Sequência de RNA/métodos , Algoritmos , Sequência Consenso , Mineração de Dados , Humanos , RNA/genética , RNA/metabolismo
9.
Oncotarget ; 8(45): 78901-78916, 2017 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-29108274

RESUMO

Disease-related protein-coding genes have been widely studied, but disease-related non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease genes distribution over 45 chromosome substructures. This idea is supported by our observation that some substructures (e.g., the chromosome 6 p-arm) are highly preferred by disease-related protein coding genes, while some (e.g., the 21 p-arm) are not favored at all. The second sub-vector is 30-dimensional, characterizing the distribution of disease gene enriched KEGG pathways in comparison with our manually created pathway groups. The second sub-vector complements with the first one to differentiate between various diseases. Our prediction method outperforms the state-of-the-art methods on benchmark datasets for prioritizing disease related lncRNA genes. The method also works well when only the sequence information of an lncRNA gene is known, or even when a given disease has no currently recognized long non-coding genes.

10.
Brief Funct Genomics ; 16(6): 361-378, 2017 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-28453648

RESUMO

The application of advanced sequencing technologies and the rapid growth of various sequence data have led to increasing interest in DNA sequence assembly. However, repeats and polymorphism occur frequently in genomes, and each of these has different impacts on assembly. Further, many new applications for sequencing, such as metagenomics regarding multiple species, have emerged in recent years. These not only give rise to higher complexity but also prevent short-read assembly in an efficient way. This article reviews the theoretical foundations that underlie current mapping-based assembly and de novo-based assembly, and highlights the key issues and feasible solutions that need to be considered. It focuses on how individual processes, such as optimal k-mer determination and error correction in assembly, rely on intelligent strategies or high-performance computation. We also survey primary algorithms/software and offer a discussion on the emerging challenges in assembly.


Assuntos
DNA/genética , Análise de Sequência de DNA/métodos , Algoritmos , Gráficos por Computador , Metagenômica , Polimorfismo de Nucleotídeo Único , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA