Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
1.
Brief Bioinform ; 25(1)2023 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-38145947

RESUMO

Determining the RNA binding preferences remains challenging because of the bottleneck of the binding interactions accompanied by subtle RNA flexibility. Typically, designing RNA inhibitors involves screening thousands of potential candidates for binding. Accurate binding site information can increase the number of successful hits even with few candidates. There are two main issues regarding RNA binding preference: binding site prediction and binding dynamical behavior prediction. Here, we propose one interpretable network-based approach, RNet, to acquire precise binding site and binding dynamical behavior information. RNetsite employs a machine learning-based network decomposition algorithm to predict RNA binding sites by analyzing the local and global network properties. Our research focuses on large RNAs with 3D structures without considering smaller regulatory RNAs, which are too small and dynamic. Our study shows that RNetsite outperforms existing methods, achieving precision values as high as 0.701 on TE18 and 0.788 on RB9 tests. In addition, RNetsite demonstrates remarkable robustness regarding perturbations in RNA structures. We also developed RNetdyn, a distance-based dynamical graph algorithm, to characterize the interface dynamical behavior consequences upon inhibitor binding. The simulation testing of competitive inhibitors indicates that RNetdyn outperforms the traditional method by 30%. The benchmark testing results demonstrate that RNet is highly accurate and robust. Our interpretable network algorithms can assist in predicting RNA binding preferences and accelerating RNA inhibitor design, providing valuable insights to the RNA research community.


Assuntos
Biologia Computacional , Proteínas de Ligação a RNA , Biologia Computacional/métodos , Proteínas de Ligação a RNA/metabolismo , Algoritmos , Sítios de Ligação , RNA/metabolismo
2.
Brief Bioinform ; 18(1): 43-56, 2017 01.
Artigo em Inglês | MEDLINE | ID: mdl-26822099

RESUMO

Untargeted metabolomics makes it possible to identify compounds that undergo significant changes in concentration in different experimental conditions. The resulting metabolomic profile characterizes the perturbation concerned, but does not explain the underlying biochemical mechanisms. Bioinformatics methods make it possible to interpret results in light of the whole metabolism. This knowledge is modelled into a network, which can be mined using algorithms that originate in graph theory. These algorithms can extract sub-networks related to the compounds identified. Several attempts have been made to adapt them to obtain more biologically meaningful results. However, there is still no consensus on this kind of analysis of metabolic networks. This review presents the main graph approaches used to interpret metabolomic data using metabolic networks. Their advantages and drawbacks are discussed, and the impacts of their parameters are emphasized. We also provide some guidelines for relevant sub-network extraction and also suggest a range of applications for most methods.


Assuntos
Metabolômica , Algoritmos , Biologia Computacional , Redes e Vias Metabólicas
3.
Proteins ; 84(4): 467-72, 2016 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-26800480

RESUMO

Protein structure prediction, when construed as a fold recognition problem, is one of the most important applications of similarity search in bioinformatics. A new protein-fold recognition method is reported which combines a single-source K diverse shortest path (SSKDSP) algorithm with Enrichment of Network Topological Similarity (ENTS) algorithm to search a graphic feature space generated using sequence similarity and structural similarity metrics. A modified, more efficient SSKDSP algorithm is developed to improve the performance of graph searching. The new implementation of the SSKDSP algorithm empirically requires 82% less memory and 61% less time than the current implementation, allowing for the analysis of larger, denser graphs. Furthermore, the statistical significance of fold ranking generated from SSKDSP is assessed using ENTS. The reported ENTS-SSKDSP algorithm outperforms original ENTS that uses random walk with restart for the graph search as well as other state-of-the-art protein structure prediction algorithms HHSearch and Sparks-X, as evaluated by a benchmark of 600 query proteins. The reported methods may easily be extended to other similarity search problems in bioinformatics and chemoinformatics. The SSKDSP software is available at http://compsci.hunter.cuny.edu/~leixie/sskdsp.html.


Assuntos
Algoritmos , Biologia Computacional/estatística & dados numéricos , Proteínas/química , Software , Sequência de Aminoácidos , Benchmarking , Biologia Computacional/métodos , Bases de Dados de Proteínas , Conformação Proteica , Dobramento de Proteína , Alinhamento de Sequência , Análise de Sequência de Proteína , Homologia de Sequência de Aminoácidos
4.
MethodsX ; 12: 102517, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38192360

RESUMO

An ordered pair Σ=(Σu,σ) is called the signed graph, where Σu=(V,E) is an underlying graph and σ is a signed mapping, called signature, from E to the sign set {+,-}. A marking of Σ is a function µ:V(Σ)→{+,-}. The canonical marking of a signed graph Σ, denoted µσ, is given as µσ(v)=Πvu∈E(Σ)σ(vu). The canonical splitting signed graphξ(Σ) of a signed graph Σ is defined as a signed graph ξ(Σ)=(V(ξ),E(ξ)) , with V(ξ)=V(Σ)∪V', where V'  is copy of a vertex set in V(Σ) s.t. for each vertex u∈V(Σ), take a new vertex u' and E(ξ) is defined as, join u' to all the vertices of Σ adjacent to u by negative edge if µσ(u)=µσ(v)=-, where v∈N(u) and by positive edge otherwise. The objective of this paper is to propose an algorithm for the generation of a canonical splitting signed graph, a splitting root signed graph from a given signed graph, provided it exists and to give the characterization of balanced canonical splitting signed graph. Additionally, we conduct a spectral analysis of the resulting graph. Spectral analysis is performed on the adjacency and Laplacian matrices of the canonical splitting signed graph to study its eigenvalues and eigenvectors. A relationship between the energy of the original signed graph Σ and the energy of the canonical splitting signed graph ξ(Σ) is established. •Algorithm to generate canonical splitting signed graph ξ(Σ).•Spectral Analysis is performed for both adjacency and Laplacian matrices of canonical splitting signed graph ξ(Σ).

5.
PeerJ Comput Sci ; 9: e1333, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37346701

RESUMO

Background: COVID-19 is an infectious disease caused by SARS-CoV-2. The symptoms of COVID-19 vary from mild-to-moderate respiratory illnesses, and it sometimes requires urgent medication. Therefore, it is crucial to detect COVID-19 at an early stage through specific clinical tests, testing kits, and medical devices. However, these tests are not always available during the time of the pandemic. Therefore, this study developed an automatic, intelligent, rapid, and real-time diagnostic model for the early detection of COVID-19 based on its symptoms. Methods: The COVID-19 knowledge graph (KG) constructed based on literature from heterogeneous data is imported to understand the COVID-19 different relations. We added human disease ontology to the COVID-19 KG and applied a node-embedding graph algorithm called fast random projection to extract an extra feature from the COVID-19 dataset. Subsequently, experiments were conducted using two machine learning (ML) pipelines to predict COVID-19 infection from its symptoms. Additionally, automatic tuning of the model hyperparameters was adopted. Results: We compared two graph-based ML models, logistic regression (LR) and random forest (RF) models. The proposed graph-based RF model achieved a small error rate = 0.0064 and the best scores on all performance metrics, including specificity = 98.71%, accuracy = 99.36%, precision = 99.65%, recall = 99.53%, and F1-score = 99.59%. Furthermore, the Matthews correlation coefficient achieved by the RF model was higher than that of the LR model. Comparative analysis with other ML algorithms and with studies from the literature showed that the proposed RF model exhibited the best detection accuracy. Conclusion: The graph-based RF model registered high performance in classifying the symptoms of COVID-19 infection, thereby indicating that the graph data science, in conjunction with ML techniques, helps improve performance and accelerate innovations.

6.
Front Med (Lausanne) ; 10: 1081087, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37250641

RESUMO

Introduction: Early diagnosis of Parkinson's disease (PD) is important to identify treatments to slow neurodegeneration. People who develop PD often have symptoms before the disease manifests and may be coded as diagnoses in the electronic health record (EHR). Methods: To predict PD diagnosis, we embedded EHR data of patients onto a biomedical knowledge graph called Scalable Precision medicine Open Knowledge Engine (SPOKE) and created patient embedding vectors. We trained and validated a classifier using these vectors from 3,004 PD patients, restricting records to 1, 3, and 5 years before diagnosis, and 457,197 non-PD group. Results: The classifier predicted PD diagnosis with moderate accuracy (AUC = 0.77 ± 0.06, 0.74 ± 0.05, 0.72 ± 0.05 at 1, 3, and 5 years) and performed better than other benchmark methods. Nodes in the SPOKE graph, among cases, revealed novel associations, while SPOKE patient vectors revealed the basis for individual risk classification. Discussion: The proposed method was able to explain the clinical predictions using the knowledge graph, thereby making the predictions clinically interpretable. Through enriching EHR data with biomedical associations, SPOKE may be a cost-efficient and personalized way to predict PD diagnosis years before its occurrence.

7.
PeerJ Comput Sci ; 9: e1291, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37346513

RESUMO

The detection of communities in graph datasets provides insight about a graph's underlying structure and is an important tool for various domains such as social sciences, marketing, traffic forecast, and drug discovery. While most existing algorithms provide fast approaches for community detection, their results usually contain strictly separated communities. However, most datasets would semantically allow for or even require overlapping communities that can only be determined at much higher computational cost. We build on an efficient algorithm, Fox, that detects such overlapping communities. Fox measures the closeness of a node to a community by approximating the count of triangles which that node forms with that community. We propose LazyFox, a multi-threaded adaptation of the Fox algorithm, which provides even faster detection without an impact on community quality. This allows for the analyses of significantly larger and more complex datasets. LazyFox enables overlapping community detection on complex graph datasets with millions of nodes and billions of edges in days instead of weeks. As part of this work, LazyFox's implementation was published and is available as a tool under an MIT licence at https://github.com/TimGarrels/LazyFox.

8.
Genome Biol ; 24(1): 136, 2023 Jun 09.
Artigo em Inglês | MEDLINE | ID: mdl-37296461

RESUMO

We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA , Bactérias
9.
Genes (Basel) ; 12(2)2021 01 27.
Artigo em Inglês | MEDLINE | ID: mdl-33514030

RESUMO

The scale of genetic methods are presently being expanded: forensic genetic assays previously were limited to tens of loci, but now technologies allow for a transition to forensic genomic approaches that assess thousands to millions of loci. However, there are subtle distinctions between genetic assays and their genomic counterparts (especially in the context of forensics). For instance, forensic genetic approaches tend to describe a locus as a haplotype, be it a microhaplotype or a short tandem repeat with its accompanying flanking information. In contrast, genomic assays tend to provide not haplotypes but sequence variants or differences, variants which in turn describe how the alleles apparently differ from the reference sequence. By the given construction, mitochondrial genetic assays can be thought of as genomic as they often describe genetic differences in a similar way. The mitochondrial genetics literature makes clear that sequence differences, unlike the haplotypes they encode, are not comparable to each other. Different alignment algorithms and different variant calling conventions may cause the same haplotype to be encoded in multiple ways. This ambiguity can affect evidence and reference profile comparisons as well as how "match" statistics are computed. In this study, a graph algorithm is described (and implemented in the MMDIT (Mitochondrial Mixture Database and Interpretation Tool) R package) that permits the assessment of forensic match statistics on mitochondrial DNA mixtures in a way that is invariant to both the variant calling conventions followed and the alignment parameters considered. The algorithm described, given a few modest constraints, can be used to compute the "random man not excluded" statistic or the likelihood ratio. The performance of the approach is assessed in in silico mitochondrial DNA mixtures.


Assuntos
Algoritmos , Biologia Computacional/métodos , DNA Mitocondrial , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA/métodos , Software , Alelos , Variação Genética , Genótipo , Haplótipos
10.
Forensic Sci Int Genet ; 55: 102568, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34416654

RESUMO

Short tandem repeats of the nuclear genome have been the preferred markers for analyzing forensic DNA mixtures. However, when nuclear DNA in a sample is degraded or limited, mitochondrial DNA (mtDNA) markers provide a powerful alternative. Though historically considered challenging, the interpretation and analysis of mtDNA mixtures have recently seen renewed interest with the advent of massively parallel sequencing. However, there are only a few software tools available for mtDNA mixture interpretation. To address this gap, the Mitochondrial Mixture Deconvolution and Interpretation Tool (MMDIT) was developed. MMDIT is an interactive application complete with a graphical user interface that allows users to deconvolve mtDNA (whole or partial genomes) mixtures into constituent donor haplotypes and estimate random match probabilities on these resultant haplotypes. In cases where deconvolution might not be feasible, the software allows mixture analysis directly within a binary framework (i.e. qualitatively, only using data on allele presence/absence). This paper explains the functionality of MMDIT, using an example of an in vitro two-person mtDNA mixture with a ratio of 1:4. The uniqueness of MMDIT lies in its ability to resolve mixtures into complete donor haplotypes using a statistical phasing framework before mixture analysis and evaluating statistical weights employing a novel graph algorithm approach. MMDIT is the first available open-source software that can automate mtDNA mixture deconvolution and analysis. The MMDIT web application can be accessed online at https://www.unthsc.edu/mmdit/. The source code is available at https://github.com/SammedMandape/MMDIT_UI and archived on zenodo (https://doi.org/10.5281/zenodo.4770184).


Assuntos
DNA Mitocondrial , Sequenciamento de Nucleotídeos em Larga Escala , DNA Mitocondrial/genética , Haplótipos , Humanos , Análise de Sequência de DNA , Software
11.
Artigo em Inglês | MEDLINE | ID: mdl-35300322

RESUMO

Estimating the time lag between a pair of time series is of significance in many practical applications. In this article, we introduce a method to quantify such lags by adapting the visibility graph algorithm, which converts time series into a mathematical graph. Currently widely used method to detect such lags is based on cross-correlations, which has certain limitations. We present simulated examples where the new method identifies the lag correctly and unambiguously while as the cross-correlation method does not. The article includes results from an extensive simulation study conducted to better understand the scenarios where the new method performed better or worse than the existing approach. We also present a likelihood based parametric modeling framework and consider frameworks for quantifying uncertainty and hypothesis testing for the new approach. We apply the current and new methods to two case studies, one from neuroscience and the other from environmental epidemiology, to illustrate the methods further.

12.
Front Pharmacol ; 11: 567088, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33424585

RESUMO

Traditional Chinese medicine (TCM) formulas treat complex diseases through combined botanical drugs which follow specific compatibility rules to reduce toxicity and increase efficiency. "Jun, Chen, Zuo and Shi" is one of most used compatibility rules in the combination of botanical drugs. However, due to the deficiency of traditional research methods, the quantified theoretical basis of herbal compatibility including principles of "Jun, Chen, Zuo and Shi" are still unclear. Network pharmacology is a new strategy based on system biology and multi-disciplines, which can systematically and comprehensively observe the intervention of drugs on disease networks, and is especially suitable for the research of TCM in the treatment of complex diseases. In this study, we systematically decoded the "Jun, Chen, Zuo and Shi" rules of Huanglian Jiedu Decoction (HJD) in the treatment of diseases for the first time. This interpretation method considered three levels of data. The data in the first level mainly depicts the characteristics of each component in single botanical drug of HJD, include the physical and chemical properties of component, ADME properties and functional enrichment analysis of component targets. The second level data is the characterization of component-target-protein (C-T-P) network in the whole protein-protein interaction (PPI) network, mainly include the characterization of degree and key communities in C-T-P network. The third level data is the characterization of intervention propagation properties of HJD in the treatment of different complex diseases, mainly include target coverage of pathogenic genes and propagation coefficient of intervention effect between target proteins and pathogenic genes. Finally, our method was validated by metabolic data, which could be used to detect the components absorbed into blood. This research shows the scientific basis of "Jun-Chen-Zuo-Shi" from a multi-dimensional perspective, and provides a good methodological reference for the subsequent interpretation of key components and speculation mechanism of the formula.

13.
Algorithms Mol Biol ; 13: 3, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29445416

RESUMO

BACKGROUND: Reconstructing the genome of a species from short fragments is one of the oldest bioinformatics problems. Metagenomic assembly is a variant of the problem asking to reconstruct the circular genomes of all bacterial species present in a sequencing sample. This problem can be naturally formulated as finding a collection of circular walks of a directed graph G that together cover all nodes, or edges, of G. APPROACH: We address this problem with the "safe and complete" framework of Tomescu and Medvedev (Research in computational Molecular biology-20th annual conference, RECOMB 9649:152-163, 2016). An algorithm is called safe if it returns only those walks (also called safe) that appear as subwalk in all metagenomic assembly solutions for G. A safe algorithm is called complete if it returns all safe walks of G. RESULTS: We give graph-theoretic characterizations of the safe walks of G, and a safe and complete algorithm finding all safe walks of G. In the node-covering case, our algorithm runs in time [Formula: see text], and in the edge-covering case it runs in time [Formula: see text]; n and m denote the number of nodes and edges, respectively, of G. This algorithm constitutes the first theoretical tight upper bound on what can be safely assembled from metagenomic reads using this problem formulation.

14.
Oncotarget ; 9(14): 11541-11558, 2018 Feb 20.
Artigo em Inglês | MEDLINE | ID: mdl-29545918

RESUMO

Breast cancer diagnosis in young women has emerged as an independent prognostic factor with higher recurrence risk and death than their older counterparts. We aim to find recurrent somatic copy number alteration (CNA) regions identified from breast cancer microarray data and associate the CNA status of the genes harbored in the regions to the survival of young women with breast cancer. By using the interval graph-based algorithm we developed, and the CNA data consisting of a Discovery set with 130 young women and a Validation set with 125 young women, we identified 38 validated recurrent CNAs containing 39 protein encoding genes. CNA gain regions encompassing genes CAPN2, CDC73 and ASB13 are the top 3 with the highest occurring frequencies in both the Discovery and Validation dataset, while gene SGCZ ranked top for the recurrent CNA loss regions. The mutation status of 9 of the 39 genes shows significant associations with breast cancer specific survival. Interestingly, the expression level of 2 of the 9 genes, ASB13 and SGCZ, shows significant association with survival outcome. Patients with CNA mutations in both of these genes had a worse survival outcome when compared to patients without the gene mutations. The mutated CNA status in gene ASB13 was associated with a higher gene expression, which predicted patient survival outcome. Together, identification of the CNA events with prognostic significance in young women with breast cancer may be used in genomic-guided treatment.

15.
J Comput Biol ; 24(6): 590-602, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-27749096

RESUMO

Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs-a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question remains: given a genome graph G (e.g., a de Bruijn, or a string graph), what are all the strings that can be safely reported from G as contigs? In this article, we answer this question using a model in which the genome is a circular covering walk. We also give a polynomial-time algorithm to find such strings, which we call omnitigs. Our experiments show that omnitigs are 66%-82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.


Assuntos
Algoritmos , Mapeamento de Sequências Contíguas/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Modelos Genéticos , Análise de Sequência de DNA/métodos , Genoma Humano , Humanos
16.
J Comput Biol ; 24(12): 1195-1211, 2017 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-28891687

RESUMO

The problem of aligning multiple metabolic pathways is one of very challenging problems in computational biology. A metabolic pathway consists of three types of entities: reactions, compounds, and enzymes. Based on similarities between enzymes, Tohsato et al. gave an algorithm for aligning multiple metabolic pathways. However, the algorithm given by Tohsato et al. neglects the similarities among reactions, compounds, enzymes, and pathway topology. How to design algorithms for the alignment problem of multiple metabolic pathways based on the similarity of reactions, compounds, and enzymes? It is a difficult computational problem. In this article, we propose an algorithm for the problem of aligning multiple metabolic pathways based on the similarities among reactions, compounds, enzymes, and pathway topology. First, we compute a weight between each pair of like entities in different input pathways based on the entities' similarity score and topological structure using Ay et al.'s methods. We then construct a weighted k-partite graph for the reactions, compounds, and enzymes. We extract a mapping between these entities by solving the maximum-weighted k-partite matching problem by applying a novel heuristic algorithm. By analyzing the alignment results of multiple pathways in different organisms, we show that the alignments found by our algorithm correctly identify common subnetworks among multiple pathways.


Assuntos
Algoritmos , Biologia Computacional/métodos , Redes e Vias Metabólicas , Humanos , Mapeamento de Interação de Proteínas , Alinhamento de Sequência
17.
J Comput Sci ; 17(Pt 1): 62-72, 2016 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-28503215

RESUMO

Structurally cohesive subgroups are a powerful and mathematically rigorous way to characterize network robustness. Their strength lies in the ability to detect strong connections among vertices that not only have no neighbors in common, but that may be distantly separated in the graph. Unfortunately, identifying cohesive subgroups is a computationally intensive problem, which has limited empirical assessments of cohesion to relatively small graphs of at most a few thousand vertices. We describe here an approach that exploits the properties of cliques, k-cores and vertex separators to iteratively reduce the complexity of the graph to the point where standard algorithms can be used to complete the analysis. As a proof of principle, we apply our method to the cohesion analysis of a 29,462-vertex biconnected component extracted from a 128,151-vertex co-authorship data set.

18.
J Mol Biol ; 425(21): 3964-9, 2013 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-23886866

RESUMO

Advances in sequencing technologies are allowing genome-wide association studies at an ever-growing scale. The interpretation of these studies requires dealing with statistical and combinatorial challenges, owing to the multi-factorial nature of human diseases and the huge space of genomic markers that are being monitored. Recently, it was proposed that using protein-protein interaction network information could help in tackling these challenges by restricting attention to markers or combinations of markers that map to close proteins in the network. In this review, we survey techniques for integrating genomic variation data with network information to improve our understanding of complex diseases and reveal meaningful associations.


Assuntos
Variação Genética , Estudo de Associação Genômica Ampla/métodos , Domínios e Motivos de Interação entre Proteínas/genética , Proteínas/genética , Análise de Sequência de DNA/métodos , Predisposição Genética para Doença , Genoma Humano , Humanos , Proteínas/metabolismo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA