Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 55
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
J Biomed Inform ; 150: 104605, 2024 02.
Artigo em Inglês | MEDLINE | ID: mdl-38331082

RESUMO

OBJECTIVE: Physicians and clinicians rely on data contained in electronic health records (EHRs), as recorded by health information technology (HIT), to make informed decisions about their patients. The reliability of HIT systems in this regard is critical to patient safety. Consequently, better tools are needed to monitor the performance of HIT systems for potential hazards that could compromise the collected EHRs, which in turn could affect patient safety. In this paper, we propose a new framework for detecting anomalies in EHRs using sequence of clinical events. This new framework, EHR-Bidirectional Encoder Representations from Transformers (BERT), is motivated by the gaps in the existing deep-learning related methods, including high false negatives, sub-optimal accuracy, higher computational cost, and the risk of information loss. EHR-BERT is an innovative framework rooted in the BERT architecture, meticulously tailored to navigate the hurdles in the contemporary BERT method; thus, enhancing anomaly detection in EHRs for healthcare applications. METHODS: The EHR-BERT framework was designed using the Sequential Masked Token Prediction (SMTP) method. This approach treats EHRs as natural language sentences and iteratively masks input tokens during both training and prediction stages. This method facilitates the learning of EHR sequence patterns in both directions for each event and identifies anomalies based on deviations from the normal execution models trained on EHR sequences. RESULTS: Extensive experiments on large EHR datasets across various medical domains demonstrate that EHR-BERT markedly improves upon existing models. It significantly reduces the number of false positives and enhances the detection rate, thus bolstering the reliability of anomaly detection in electronic health records. This improvement is attributed to the model's ability to minimize information loss and maximize data utilization effectively. CONCLUSION: EHR-BERT showcases immense potential in decreasing medical errors related to anomalous clinical events, positioning itself as an indispensable asset for enhancing patient safety and the overall standard of healthcare services. The framework effectively overcomes the drawbacks of earlier models, making it a promising solution for healthcare professionals to ensure the reliability and quality of health data.


Assuntos
Registros Eletrônicos de Saúde , Sistemas de Informação em Saúde , Humanos , Reprodutibilidade dos Testes , Registros , Pessoal de Saúde
2.
J Biomed Inform ; 135: 104219, 2022 11.
Artigo em Inglês | MEDLINE | ID: mdl-36243337

RESUMO

Detecting anomalous sequences is an integral part of building and protecting modern large-scale health information technology (HIT) systems. These HIT systems generate a large volume of records of patients' state and significant events, which provide a valuable resource to help improve clinical decisions, patient care processes, and other issues. However, detecting anomalous sequences in electronic health records (EHR) remains a challenge in healthcare applications for several reasons, including imbalances in the data, complexity of relationships between events in the sequence, and the curse of dimensionality. Conventional anomaly detection methods use the finite sequence of events to discriminate sequences. They fail to incorporate salient event details under variable higher-order dependencies (e.g., duration between events) that can provide better discrimination of sequences in their models. To address this problem, we propose event sequence and subsequence anomaly detection algorithms that (1) use network-based representations of interactions in the data, (2) account for variable higher-order dependencies in the data, and (3) incorporate events duration for adequate discrimination of the data. The proposed approach identifies anomalies by monitoring the change in the graph after the test sequence is removed from the network. The change is quantified using graph distance metrics so that dramatic changes in the network can be attributed to the removed sequence. Furthermore, the proposed subsequence algorithm recommends plausible paths and salient information for the detected anomalous subsequences. Our results show that the proposed event sequence anomaly detection algorithm outperforms the baseline methods for both synthetic data and real-world EHR data.


Assuntos
Algoritmos , Registros Eletrônicos de Saúde , Humanos
3.
BMC Bioinformatics ; 20(Suppl 15): 503, 2019 Dec 24.
Artigo em Inglês | MEDLINE | ID: mdl-31874625

RESUMO

BACKGROUND: Cluster analysis is a core task in modern data-centric computation. Algorithmic choice is driven by factors such as data size and heterogeneity, the similarity measures employed, and the type of clusters sought. Familiarity and mere preference often play a significant role as well. Comparisons between clustering algorithms tend to focus on cluster quality. Such comparisons are complicated by the fact that algorithms often have multiple settings that can affect the clusters produced. Such a setting may represent, for example, a preset variable, a parameter of interest, or various sorts of initial assignments. A question of interest then is this: to what degree do the clusters produced vary as setting values change? RESULTS: This work introduces a new metric, termed simply "robustness", designed to answer that question. Robustness is an easily-interpretable measure of the propensity of a clustering algorithm to maintain output coherence over a range of settings. The robustness of eleven popular clustering algorithms is evaluated over some two dozen publicly available mRNA expression microarray datasets. Given their straightforwardness and predictability, hierarchical methods generally exhibited the highest robustness on most datasets. Of the more complex strategies, the paraclique algorithm yielded consistently higher robustness than other algorithms tested, approaching and even surpassing hierarchical methods on several datasets. Other techniques exhibited mixed robustness, with no clear distinction between them. CONCLUSIONS: Robustness provides a simple and intuitive measure of the stability and predictability of a clustering algorithm. It can be a useful tool to aid both in algorithm selection and in deciding how much effort to devote to parameter tuning.


Assuntos
Algoritmos , Biometria , Análise por Conglomerados , Perfilação da Expressão Gênica
4.
Nucleic Acids Res ; 44(D1): D555-9, 2016 Jan 04.
Artigo em Inglês | MEDLINE | ID: mdl-26656951

RESUMO

The GeneWeaver data and analytics website (www.geneweaver.org) is a publically available resource for storing, curating and analyzing sets of genes from heterogeneous data sources. The system enables discovery of relationships among genes, variants, traits, drugs, environments, anatomical structures and diseases implicitly found through gene set intersections. Since the previous review in the 2012 Nucleic Acids Research Database issue, GeneWeaver's underlying analytics platform has been enhanced, its number and variety of publically available gene set data sources has been increased, and its advanced search mechanisms have been expanded. In addition, its interface has been redesigned to take advantage of flexible web services, programmatic data access, and a refined data model for handling gene network data in addition to its original emphasis on gene set data. By enumerating the common and distinct biological molecules associated with all subsets of curated or user submitted groups of gene sets and gene networks, GeneWeaver empowers users with the ability to construct data driven descriptions of shared and unique biological processes, diseases and traits within and across species.


Assuntos
Bases de Dados Genéticas , Doença/genética , Genes , Genômica , Animais , Cães , Humanos , Camundongos , Fenótipo , Ratos , Software
5.
PLoS Genet ; 10(1): e1004059, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24391521

RESUMO

Altered DNA methylation patterns in CD4(+) T-cells indicate the importance of epigenetic mechanisms in inflammatory diseases. However, the identification of these alterations is complicated by the heterogeneity of most inflammatory diseases. Seasonal allergic rhinitis (SAR) is an optimal disease model for the study of DNA methylation because of its well-defined phenotype and etiology. We generated genome-wide DNA methylation (N(patients) = 8, N(controls) = 8) and gene expression (N(patients) = 9, Ncontrols = 10) profiles of CD4(+) T-cells from SAR patients and healthy controls using Illumina's HumanMethylation450 and HT-12 microarrays, respectively. DNA methylation profiles clearly and robustly distinguished SAR patients from controls, during and outside the pollen season. In agreement with previously published studies, gene expression profiles of the same samples failed to separate patients and controls. Separation by methylation (N(patients) = 12, N(controls) = 12), but not by gene expression (N(patients) = 21, N(controls) = 21) was also observed in an in vitro model system in which purified PBMCs from patients and healthy controls were challenged with allergen. We observed changes in the proportions of memory T-cell populations between patients (N(patients) = 35) and controls (N(controls) = 12), which could explain the observed difference in DNA methylation. Our data highlight the potential of epigenomics in the stratification of immune disease and represents the first successful molecular classification of SAR using CD4(+) T cells.


Assuntos
Linfócitos T CD4-Positivos/metabolismo , Metilação de DNA/genética , Epigênese Genética , Rinite Alérgica Sazonal/genética , Adulto , Alérgenos/genética , Alérgenos/imunologia , Linfócitos T CD4-Positivos/imunologia , Expressão Gênica , Genoma Humano , Humanos , Patologia Molecular , Pólen/imunologia , Rinite Alérgica Sazonal/imunologia , Rinite Alérgica Sazonal/patologia
6.
J Natl Med Assoc ; 109(4): 246-251, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29173931

RESUMO

OBJECTIVE: Describe trends in non-Hispanic black infant mortality (IM) in the New York City (NYC) counties of Bronx, Kings, Queens, and Manhattan and correlations with gun-related assault mortality. METHODS: Linked Birth/Infant Death data (1999-2013) and Compressed Mortality data at ages 1 to ≥85 years (1999-2013). NYC and United States (US) Census data for income inequality and poverty. Pearson coefficients were used to describe correlations of IM with gun-related assault mortality and other causes of death. RESULTS: In NYC, the risk of non-Hispanic black IM in 2013 was 49% lower than in 1995 (rate ratio: 0.51; 95% CI: 0.43, 0.61). Yearly declines between 1999 and 2013 were significantly correlated with declines in gun-related assault mortality (correlation coefficient (r) = 0.70, p = 0.004), drug-related mortality (r = 0.59, p = 0.020), major heart disease and stroke (r = 0.85, p < 0.001), malignant neoplasms (r = 0.57, p = 0.026), diabetes mellitus (r = 0.63, p = 0.011), and pneumonia and influenza (r = 0.78, p < 0.001). There were no significant correlations of IM with chronic lower respiratory or liver disease, non-drug-related accidental deaths, and non-gun-related assault. Yearly IM (1995-2012) was inversely correlated with income share of the top 1% of the population (r = -0.66, p = 0.007). CONCLUSIONS: In NYC, non-Hispanic black IM declined significantly despite increasing income inequality and was strongly correlated with gun-related assault mortality and other major causes of death. These data are compatible with the hypothesis that activities related to overall population health, including those pertaining to gun-related homicide, may provide clues to reducing IM. Analytic epidemiological studies are needed to test these and other hypotheses formulated from these descriptive data.


Assuntos
Negro ou Afro-Americano , Causas de Morte/tendências , Violência com Arma de Fogo/tendências , Morte do Lactente/etiologia , Mortalidade Infantil/tendências , Saúde da População Urbana/tendências , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Criança , Pré-Escolar , Feminino , Violência com Arma de Fogo/etnologia , Humanos , Lactente , Mortalidade Infantil/etnologia , Masculino , Pessoa de Meia-Idade , Cidade de Nova Iorque/epidemiologia , Fatores Socioeconômicos , Saúde da População Urbana/etnologia , Adulto Jovem
7.
Environ Res ; 146: 173-84, 2016 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-26765097

RESUMO

The exposome provides a framework for understanding elucidation of an uncharacterized molecular mechanism conferring enhanced susceptibility of macrophage membranes to bacterial infection after exposure to the environmental contaminant benzo(a)pyrene, [B(a)P]. The fundamental requirement in activation of macrophage effector functions is the binding of immunoglobulins to Fc receptors. FcγRIIa (CD32a), a member of the Fc family of immunoreceptors with low affinity for immunoglobulin G, has been reported to bind preferentially to IgG within lipid rafts. Previous research suggested that exposure to B(a)P suppressed macrophage effector functions but the molecular mechanisms remain elusive. The goal of this study was to elucidate the mechanism(s) of B(a)P-exposure induced suppression of macrophage function by examining the resultant effects of exposure-induced insult on CD32-lipid raft interactions in the regulation of IgG binding to CD32. The results demonstrate that exposure of macrophages to B(a)P alters lipid raft integrity by decreasing membrane cholesterol 25% while increasing CD32 into non-lipid raft fractions. This robust diminution in membrane cholesterol and 30% exclusion of CD32 from lipid rafts causes a significant reduction in CD32-mediated IgG binding to suppress essential macrophage effector functions. Such exposures across the lifespan would have the potential to induce immunosuppressive endophenotypes in vulnerable populations.


Assuntos
Poluentes Atmosféricos/toxicidade , Benzo(a)pireno/toxicidade , Macrófagos/efeitos dos fármacos , Microdomínios da Membrana/efeitos dos fármacos , Nistatina/farmacologia , beta-Ciclodextrinas/farmacologia , Células Cultivadas , Humanos , Imunoglobulina G/metabolismo , Macrófagos/imunologia , Receptores de IgG/genética , Receptores de IgG/metabolismo , Transdução de Sinais
8.
Discrete Appl Math ; 204: 208-212, 2016 May 11.
Artigo em Inglês | MEDLINE | ID: mdl-27057077

RESUMO

The scientific literature teems with clique-centric clustering strategies. In this paper we analyze one such method, the paraclique algorithm. Paraclique has found practical utility in a variety of application domains, and has been successfully employed to reduce the effects of noise. Nevertheless, its formal analysis and worst-case guarantees have remained elusive. We address this issue by deriving a series of lower bounds on paraclique densities.

9.
Mamm Genome ; 26(9-10): 556-66, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26092690

RESUMO

A persistent challenge lies in the interpretation of consensus and discord from functional genomics experimentation. Harmonizing and analyzing this data will enable investigators to discover relations of many genes to many diseases, and from many phenotypes and experimental paradigms to many diseases through their genomic substrates. The GeneWeaver.org system provides a platform for cross-species integration and interrogation of heterogeneous curated and experimentally derived functional genomics data. GeneWeaver enables researchers to store, share, analyze, and compare results of their own genome-wide functional genomics experiments in an environment containing rich companion data obtained from major curated repositories, including the Mouse Genome Database and other model organism databases, along with derived data from highly specialized resources, publications, and user submissions. The data, largely consisting of gene sets and putative biological networks, are mapped onto one another through gene identifiers and homology across species. A versatile suite of interactive tools enables investigators to perform a variety of set analysis operations to find consilience among these often noisy experimental results. Fast algorithms enable real-time analysis of large queries. Specific applications include prioritizing candidate genes for quantitative trait loci, identifying biologically valid mouse models and phenotypic assays for human disease, finding the common biological substrates of related diseases, classifying experiments and the biological concepts they represent from empirical data, and applying patterns of genomic evidence to implicate novel genes in disease. These results illustrate an alternative to strict emphasis on replicability, whereby researchers classify experimental results to identify the conditions that lead to their similarity.


Assuntos
Bases de Dados Genéticas , Genômica , Locos de Características Quantitativas/genética , Algoritmos , Animais , Humanos , Internet , Camundongos , Fenótipo , Software , Transcriptoma/genética
10.
BMC Bioinformatics ; 15: 110, 2014 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-24731198

RESUMO

BACKGROUND: Integrating and analyzing heterogeneous genome-scale data is a huge algorithmic challenge for modern systems biology. Bipartite graphs can be useful for representing relationships across pairs of disparate data types, with the interpretation of these relationships accomplished through an enumeration of maximal bicliques. Most previously-known techniques are generally ill-suited to this foundational task, because they are relatively inefficient and without effective scaling. In this paper, a powerful new algorithm is described that produces all maximal bicliques in a bipartite graph. Unlike most previous approaches, the new method neither places undue restrictions on its input nor inflates the problem size. Efficiency is achieved through an innovative exploitation of bipartite graph structure, and through computational reductions that rapidly eliminate non-maximal candidates from the search space. An iterative selection of vertices for consideration based on non-decreasing common neighborhood sizes boosts efficiency and leads to more balanced recursion trees. RESULTS: The new technique is implemented and compared to previously published approaches from graph theory and data mining. Formal time and space bounds are derived. Experiments are performed on both random graphs and graphs constructed from functional genomics data. It is shown that the new method substantially outperforms the best previous alternatives. CONCLUSIONS: The new method is streamlined, efficient, and particularly well-suited to the study of huge and diverse biological data. A robust implementation has been incorporated into GeneWeaver, an online tool for integrating and analyzing functional genomics experiments, available at http://geneweaver.org. The enormous increase in scalability it provides empowers users to study complex and previously unassailable gene-set associations between genes and their biological functions in a hierarchical fashion and on a genome-wide scale. This practical computational resource is adaptable to almost any applications environment in which bipartite graphs can be used to model relationships between pairs of heterogeneous entities.


Assuntos
Algoritmos , Genômica/métodos , Animais , Gráficos por Computador , Humanos , Camundongos , Ratos , Software
11.
BMC Bioinformatics ; 15: 383, 2014 Dec 10.
Artigo em Inglês | MEDLINE | ID: mdl-25492630

RESUMO

BACKGROUND: Our knowledge of global protein-protein interaction (PPI) networks in complex organisms such as humans is hindered by technical limitations of current methods. RESULTS: On the basis of short co-occurring polypeptide regions, we developed a tool called MP-PIPE capable of predicting a global human PPI network within 3 months. With a recall of 23% at a precision of 82.1%, we predicted 172,132 putative PPIs. We demonstrate the usefulness of these predictions through a range of experiments. CONCLUSIONS: The speed and accuracy associated with MP-PIPE can make this a potential tool to study individual human PPI networks (from genomic sequences alone) for personalized medicine.


Assuntos
Biologia Computacional/métodos , Genoma Humano , Mapeamento de Interação de Proteínas/métodos , Proteínas/metabolismo , Proteoma/análise , Software , Humanos
12.
Nucleic Acids Res ; 40(Database issue): D1067-76, 2012 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22080549

RESUMO

High-throughput genome technologies have produced a wealth of data on the association of genes and gene products to biological functions. Investigators have discovered value in combining their experimental results with published genome-wide association studies, quantitative trait locus, microarray, RNA-sequencing and mutant phenotyping studies to identify gene-function associations across diverse experiments, species, conditions, behaviors or biological processes. These experimental results are typically derived from disparate data repositories, publication supplements or reconstructions from primary data stores. This leaves bench biologists with the complex and unscalable task of integrating data by identifying and gathering relevant studies, reanalyzing primary data, unifying gene identifiers and applying ad hoc computational analysis to the integrated set. The freely available GeneWeaver (http://www.GeneWeaver.org) powered by the Ontological Discovery Environment is a curated repository of genomic experimental results with an accompanying tool set for dynamic integration of these data sets, enabling users to interactively address questions about sets of biological functions and their relations to sets of genes. Thus, large numbers of independently published genomic results can be organized into new conceptual frameworks driven by the underlying, inferred biological relationships rather than a pre-existing semantic framework. An empirical 'ontology' is discovered from the aggregate of experimental knowledge around user-defined areas of biological inquiry.


Assuntos
Bases de Dados Genéticas , Genômica/métodos , Gráficos por Computador , Genes , Internet , Software , Integração de Sistemas
13.
Nat Genet ; 37(3): 233-42, 2005 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-15711545

RESUMO

Patterns of gene expression in the central nervous system are highly variable and heritable. This genetic variation among normal individuals leads to considerable structural, functional and behavioral differences. We devised a general approach to dissect genetic networks systematically across biological scale, from base pairs to behavior, using a reference population of recombinant inbred strains. We profiled gene expression using Affymetrix oligonucleotide arrays in the BXD recombinant inbred strains, for which we have extensive SNP and haplotype data. We integrated a complementary database comprising 25 years of legacy phenotypic data on these strains. Covariance among gene expression and pharmacological and behavioral traits is often highly significant, corroborates known functional relations and is often generated by common quantitative trait loci. We found that a small number of major-effect quantitative trait loci jointly modulated large sets of transcripts and classical neural phenotypes in patterns specific to each tissue. We developed new analytic and graph theoretical approaches to study shared genetic modulation of networks of traits using gene sets involved in neural synapse function as an example. We built these tools into an open web resource called WebQTL that can be used to test a broad array of hypotheses.


Assuntos
Regulação da Expressão Gênica , Fenômenos Fisiológicos do Sistema Nervoso , Locos de Características Quantitativas , Animais , Epistasia Genética , Haplótipos , Camundongos , Análise de Sequência com Séries de Oligonucleotídeos , Fenótipo , Polimorfismo de Nucleotídeo Único , RNA Mensageiro/genética
14.
J Comput Biol ; 31(6): 539-548, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38781420

RESUMO

The thresholding problem is studied in the context of graph theoretical analysis of gene co-expression data. A number of thresholding methodologies are described, implemented, and tested over a large collection of graphs derived from real high-throughput biological data. Comparative results are presented and discussed.


Assuntos
Algoritmos , Perfilação da Expressão Gênica , Perfilação da Expressão Gênica/métodos , Humanos , Biologia Computacional/métodos , Redes Reguladoras de Genes
15.
Exposome ; 4(1): osae001, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38344436

RESUMO

This paper explores the exposome concept and its role in elucidating the interplay between environmental exposures and human health. We introduce two key concepts critical for exposomics research. Firstly, we discuss the joint impact of genetics and environment on phenotypes, emphasizing the variance attributable to shared and nonshared environmental factors, underscoring the complexity of quantifying the exposome's influence on health outcomes. Secondly, we introduce the importance of advanced data-driven methods in large cohort studies for exposomic measurements. Here, we introduce the exposome-wide association study (ExWAS), an approach designed for systematic discovery of relationships between phenotypes and various exposures, identifying significant associations while controlling for multiple comparisons. We advocate for the standardized use of the term "exposome-wide association study, ExWAS," to facilitate clear communication and literature retrieval in this field. The paper aims to guide future health researchers in understanding and evaluating exposomic studies. Our discussion extends to emerging topics, such as FAIR Data Principles, biobanked healthcare datasets, and the functional exposome, outlining the future directions in exposomic research. This abstract provides a succinct overview of our comprehensive approach to understanding the complex dynamics of the exposome and its significant implications for human health.

16.
Environ Health Perspect ; 131(12): 124201, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-38109119

RESUMO

BACKGROUND: The exposome serves as a popular framework in which to study exposures from chemical and nonchemical stressors across the life course and the differing roles that these exposures can play in human health. As a result, data relevant to the exposome have been used as a resource in the quest to untangle complicated health trajectories and help connect the dots from exposures to adverse outcome pathways. OBJECTIVES: The primary aim of this methods seminar is to clarify and review preprocessing techniques critical for accurate and effective external exposomic data analysis. Scalability is emphasized through an application of highly innovative combinatorial techniques coupled with more traditional statistical strategies. The Public Health Exposome is used as an archetypical model. The novelty and innovation of this seminar's focus stem from its methodical, comprehensive treatment of preprocessing and its demonstration of the positive effects preprocessing can have on downstream analytics. DISCUSSION: State-of-the-art technologies are described for data harmonization and to mitigate noise, which can stymie downstream interpretation, and to select key exposomic features, without which analytics may lose focus. A main task is the reduction of multicollinearity, a particularly formidable problem that frequently arises from repeated measurements of similar events taken at various times and from multiple sources. Empirical results highlight the effectiveness of a carefully planned preprocessing workflow as demonstrated in the context of more highly concentrated variable lists, improved correlational distributions, and enhanced downstream analytics for latent relationship discovery. The nascent field of exposome science can be characterized by the need to analyze and interpret a complex confluence of highly inhomogeneous spatial and temporal data, which may present formidable challenges to even the most powerful analytical tools. A systematic approach to preprocessing can therefore provide an essential first step in the application of modern computer and data science methods. https://doi.org/10.1289/EHP12901.


Assuntos
Rotas de Resultados Adversos , Análise de Dados , Expossoma , Humanos , Saúde Pública
17.
BMC Bioinformatics ; 13 Suppl 10: S5, 2012 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-22759429

RESUMO

BACKGROUND: The maximum clique enumeration (MCE) problem asks that we identify all maximum cliques in a finite, simple graph. MCE is closely related to two other well-known and widely-studied problems: the maximum clique optimization problem, which asks us to determine the size of a largest clique, and the maximal clique enumeration problem, which asks that we compile a listing of all maximal cliques. Naturally, these three problems are NP-hard, given that they subsume the classic version of the NP-complete clique decision problem. MCE can be solved in principle with standard enumeration methods due to Bron, Kerbosch, Kose and others. Unfortunately, these techniques are ill-suited to graphs encountered in our applications. We must solve MCE on instances deeply seeded in data mining and computational biology, where high-throughput data capture often creates graphs of extreme size and density. MCE can also be solved in principle using more modern algorithms based in part on vertex cover and the theory of fixed-parameter tractability (FPT). While FPT is an improvement, these algorithms too can fail to scale sufficiently well as the sizes and densities of our datasets grow. RESULTS: An extensive testbed of benchmark graphs are created using publicly available transcriptomic datasets from the Gene Expression Omnibus (GEO). Empirical testing reveals crucial but latent features of such high-throughput biological data. In turn, it is shown that these features distinguish real data from random data intended to reproduce salient topological features. In particular, with real data there tends to be an unusually high degree of maximum clique overlap. Armed with this knowledge, novel decomposition strategies are tuned to the data and coupled with the best FPT MCE implementations. CONCLUSIONS: Several algorithmic improvements to MCE are made which progressively decrease the run time on graphs in the testbed. Frequently the final runtime improvement is several orders of magnitude. As a result, instances which were once prohibitively time-consuming to solve are brought into the domain of realistic feasibility.


Assuntos
Algoritmos , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Software
18.
BMC Bioinformatics ; 13 Suppl 10: S7, 2012 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-22759431

RESUMO

BACKGROUND: A wealth of clustering algorithms has been applied to gene co-expression experiments. These algorithms cover a broad range of approaches, from conventional techniques such as k-means and hierarchical clustering, to graphical approaches such as k-clique communities, weighted gene co-expression networks (WGCNA) and paraclique. Comparison of these methods to evaluate their relative effectiveness provides guidance to algorithm selection, development and implementation. Most prior work on comparative clustering evaluation has focused on parametric methods. Graph theoretical methods are recent additions to the tool set for the global analysis and decomposition of microarray co-expression matrices that have not generally been included in earlier methodological comparisons. In the present study, a variety of parametric and graph theoretical clustering algorithms are compared using well-characterized transcriptomic data at a genome scale from Saccharomyces cerevisiae. METHODS: For each clustering method under study, a variety of parameters were tested. Jaccard similarity was used to measure each cluster's agreement with every GO and KEGG annotation set, and the highest Jaccard score was assigned to the cluster. Clusters were grouped into small, medium, and large bins, and the Jaccard score of the top five scoring clusters in each bin were averaged and reported as the best average top 5 (BAT5) score for the particular method. RESULTS: Clusters produced by each method were evaluated based upon the positive match to known pathways. This produces a readily interpretable ranking of the relative effectiveness of clustering on the genes. Methods were also tested to determine whether they were able to identify clusters consistent with those identified by other clustering methods. CONCLUSIONS: Validation of clusters against known gene classifications demonstrate that for this data, graph-based techniques outperform conventional clustering approaches, suggesting that further development and application of combinatorial strategies is warranted.


Assuntos
Algoritmos , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Análise por Conglomerados , Genoma Fúngico , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Saccharomyces cerevisiae/genética
19.
Sci Rep ; 12(1): 11897, 2022 07 13.
Artigo em Inglês | MEDLINE | ID: mdl-35831440

RESUMO

Deciding the size of a minimum dominating set is a classic NP-complete problem. It has found increasing utility as the basis for classifying vertices in networks derived from protein-protein, noncoding RNA, metabolic, and other biological interaction data. In this context it can be helpful, for example, to identify those vertices that must be present in any minimum solution. Current classification methods, however, can require solving as many instances as there are vertices, rendering them computationally prohibitive in many applications. In an effort to address this shortcoming, new classification algorithms are derived and tested for efficiency and effectiveness. Results of performance comparisons on real-world biological networks are reported.


Assuntos
Algoritmos , Proteínas
20.
Algorithms ; 14(2)2021 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-36092474

RESUMO

Recent discoveries of distinct molecular subtypes have led to remarkable advances in treatment for a variety of diseases. While subtyping via unsupervised clustering has received a great deal of interest, most methods rely on basic statistical or machine learning methods. At the same time, techniques based on graph clustering, particularly clique-based strategies, have been successfully used to identify disease biomarkers and gene networks. A graph theoretical approach based on the paraclique algorithm is described that can easily be employed to identify putative disease subtypes and serve as an aid in outlier detection as well. The feasibility and potential effectiveness of this method is demonstrated on publicly available gene co-expression data derived from patient samples covering twelve different disease families.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA