Pesquisa | Portal Regional da BVS

1.

Ten simple rules for organizations to support research data sharing.

Champieux, Robin; Solomonides, Anthony; Conte, Marisa; Rojevsky, Svetlana; Phuong, Jimmy; Dorr, David A; Zampino, Elizabeth; Wilcox, Adam; Carson, Matthew B; Holmes, Kristi.

PLoS Comput Biol ; 19(6): e1011136, 2023 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-37319166

Assuntos

Disseminação de Informação

2.

Bridging the gap: A library-based collaboration to enhance data skills for clinical researchers.

Carson, Matthew B; Gonzales, Sara; Shaw, Pamela; Schneider, Daniel; Holmes, Kristi.

Learn Health Syst ; 7(2): e10339, 2023 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-37066097

RESUMO

Introduction: Enterprise data warehouses (EDWs) serve as foundational infrastructure in a modern learning health system, housing clinical and other system-wide data and making it available for research, strategic, and quality improvement purposes. Building on a longstanding partnership between Northwestern University's Galter Health Sciences Library and the Northwestern Medicine Enterprise Data Warehouse (NMEDW), an end-to-end clinical research data management (cRDM) program was created to enhance clinical data workforce capacity and further expand related library-based services for the campus. Methods: The training program covers topics such as clinical database architecture, clinical coding standards, and translation of research questions into queries for proper data extraction. Here we describe this program, including partners and motivations, technical and social components, integration of FAIR principles into clinical data research workflows, and the long-term implications for this work to serve as a blueprint of best practice workflows for clinical research to support library and EDW partnerships at other institutions. Results: This training program has enhanced the partnership between our institution's health sciences library and clinical data warehouse to provide support services for researchers, resulting in more efficient training workflows. Through instruction on best practices for preserving and sharing outputs, researchers are given the tools to improve the reproducibility and reusability of their work, which has positive effects for the researchers as well as for the university. All training resources have been made publicly available so that those who support this critical need at other institutions can build on our efforts. Conclusions: Library-based partnerships to support training and consultation offer an important vehicle for clinical data science capacity building in learning health systems. The cRDM program launched by Galter Library and the NMEDW is an example of this type of partnership and builds on a strong foundation of past collaboration, expanding the scope of clinical data support services and training on campus.

3.

Ten simple rules for maximizing the recommendations of the NIH data management and sharing plan.

Gonzales, Sara; Carson, Matthew B; Holmes, Kristi.

PLoS Comput Biol ; 18(8): e1010397, 2022 08.

Artigo em Inglês | MEDLINE | ID: mdl-35921268

RESUMO

The National Institutes of Health (NIH) Policy for Data Management and Sharing (DMS Policy) recognizes the NIH's role as a key steward of United States biomedical research and information and seeks to enhance that stewardship through systematic recommendations for the preservation and sharing of research data generated by funded projects. The policy is effective as of January 2023. The recommendations include a requirement for the submission of a Data Management and Sharing Plan (DMSP) with funding applications, and while no strict template was provided, the NIH has released supplemental draft guidance on elements to consider when developing a plan. This article provides 10 key recommendations for creating a DMSP that is both maximally compliant and effective.

Assuntos

Pesquisa Biomédica , Gerenciamento de Dados , National Institutes of Health (U.S.) , Estados Unidos

4.

Risk Adjusting Health Care Provider Collaboration Networks.

Chandler, Ariel E; Mutharasan, R Kannan; Amelia, Lia; Carson, Matthew B; Scholtens, Denise M; Soulakis, Nicholas D.

Methods Inf Med ; 58(2-03): 71-78, 2019 09.

Artigo em Inglês | MEDLINE | ID: mdl-31514208

RESUMO

OBJECTIVES: The quality of hospital discharge care and patient factors (health and sociodemographic) impact the rates of unplanned readmissions. This study aims to measure the effects of controlling for the patient factors when using readmission rates to quantify the weighted edges between health care providers in a collaboration network. This improved understanding may inform strategies to reduce hospital readmissions, and facilitate quality-improvement initiatives. METHODS: We extracted 4 years of patient, provider, and activity data related to cardiology discharge workflow. A Weibull model was developed to predict the risk of unplanned 30-day readmission. A provider-patient bipartite network was used to connect providers by shared patient encounters. We built collaboration networks and calculated the Shared Positive Outcome Ratio (SPOR) to quantify the relationship between providers by the relative rate of patient outcomes, using both risk-adjusted readmission rates and unadjusted readmission rates. The effect of risk adjustment on the calculation of the SPOR metric was quantified using a permutation test and descriptive statistics. RESULTS: Comparing the collaboration networks consisting of 2,359 provider pairs, we found that SPOR values with risk-adjusted outcomes are significantly different than unadjusted readmission as an outcome measure (p-value = 0.025). The two networks classified the same provider pairs as high-scoring 51.5% of the time, and the same low scoring provider pairs 85.6% of the time. The observed differences in patient demographics and disease characteristics between high-scoring and low-scoring provider pairs were reduced by applying the risk-adjusted model. The risk-adjusted model reduced the average variation across each individual's SPOR scored provider connections. CONCLUSIONS: Risk adjusting unplanned readmission in a collaboration network has an effect on SPOR-weighted edges, especially on classifying high-scoring SPOR provider pairs. The risk-adjusted model reduces the variance of providers' connections and balances shared patient characteristics between low- and high-scoring provider pairs. This indicates that the risk-adjusted SPOR edges better measure the impact of collaboration on readmissions by accounting for patients' risk of readmission.

Assuntos

Comportamento Cooperativo , Pessoal de Saúde , Humanos , Avaliação de Resultados em Cuidados de Saúde , Readmissão do Paciente , Fatores de Risco

5.

Natural Language Processing for EHR-Based Pharmacovigilance: A Structured Review.

Luo, Yuan; Thompson, William K; Herr, Timothy M; Zeng, Zexian; Berendsen, Mark A; Jonnalagadda, Siddhartha R; Carson, Matthew B; Starren, Justin.

Drug Saf ; 40(11): 1075-1089, 2017 11.

Artigo em Inglês | MEDLINE | ID: mdl-28643174

RESUMO

The goal of pharmacovigilance is to detect, monitor, characterize and prevent adverse drug events (ADEs) with pharmaceutical products. This article is a comprehensive structured review of recent advances in applying natural language processing (NLP) to electronic health record (EHR) narratives for pharmacovigilance. We review methods of varying complexity and problem focus, summarize the current state-of-the-art in methodology advancement, discuss limitations and point out several promising future directions. The ability to accurately capture both semantic and syntactic structures in clinical narratives becomes increasingly critical to enable efficient and accurate ADE detection. Significant progress has been made in algorithm development and resource construction since 2000. Since 2012, statistical analysis and machine learning methods have gained traction in automation of ADE mining from EHR narratives. Current state-of-the-art methods for NLP-based ADE detection from EHRs show promise regarding their integration into production pharmacovigilance systems. In addition, integrating multifaceted, heterogeneous data sources has shown promise in improving ADE detection and has become increasingly adopted. On the other hand, challenges and opportunities remain across the frontier of NLP application to EHR-based pharmacovigilance, including proper characterization of ADE context, differentiation between off- and on-label drug-use ADEs, recognition of the importance of polypharmacy-induced ADEs, better integration of heterogeneous data sources, creation of shared corpora, and organization of shared-task challenges to advance the state-of-the-art.

Assuntos

Sistemas de Notificação de Reações Adversas a Medicamentos/normas , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/diagnóstico , Registros Eletrônicos de Saúde/normas , Processamento de Linguagem Natural , Farmacovigilância , Humanos

6.

Node Attribute-enhanced Community Detection in Complex Networks.

Jia, Caiyan; Li, Yafang; Carson, Matthew B; Wang, Xiaoyang; Yu, Jian.

Sci Rep ; 7(1): 2626, 2017 05 25.

Artigo em Inglês | MEDLINE | ID: mdl-28572625

RESUMO

Community detection involves grouping the nodes of a network such that nodes in the same community are more densely connected to each other than to the rest of the network. Previous studies have focused mainly on identifying communities in networks using node connectivity. However, each node in a network may be associated with many attributes. Identifying communities in networks combining node attributes has become increasingly popular in recent years. Most existing methods operate on networks with attributes of binary, categorical, or numerical type only. In this study, we introduce kNN-enhance, a simple and flexible community detection approach that uses node attribute enhancement. This approach adds the k Nearest Neighbor (kNN) graph of node attributes to alleviate the sparsity and the noise effect of an original network, thereby strengthening the community structure in the network. We use two testing algorithms, kNN-nearest and kNN-Kmeans, to partition the newly generated, attribute-enhanced graph. Our analyses of synthetic and real world networks have shown that the proposed algorithms achieve better performance compared to existing state-of-the-art algorithms. Further, the algorithms are able to deal with networks containing different combinations of binary, categorical, or numerical attributes and could be easily extended to the analysis of massive networks.

7.

A disease similarity matrix based on the uniqueness of shared genes.

Carson, Matthew B; Liu, Cong; Lu, Yao; Jia, Caiyan; Lu, Hui.

BMC Med Genomics ; 10(Suppl 1): 26, 2017 05 24.

Artigo em Inglês | MEDLINE | ID: mdl-28589854

RESUMO

BACKGROUND: Complex diseases involve many genes, and these genes are often associated with several different illnesses. Disease similarity measurement can be based on shared genotype or phenotype. Quantifying relationships between genes can reveal previously unknown connections and form a reference base for therapy development and drug repurposing. METHODS: Here we introduce a method to measure disease similarity that incorporates the uniqueness of shared genes. For each disease pair, we calculated the uniqueness score and constructed disease similarity matrices using OMIM and Disease Ontology annotation. RESULTS: Using the Disease Ontology-based matrix, we identified several interesting connections between cancer and other disease and conditions such as malaria, along with studies to support our findings. We also found several high scoring pairwise relationships for which there was little or no literature support, highlighting potentially interesting connections warranting additional study. CONCLUSIONS: We developed a co-occurrence matrix based on gene uniqueness to examine the relationships between diseases from OMIM and DORIF data. Our similarity matrix can be used to identify potential disease relationships and to motivate further studies investigating the causal mechanisms in diseases.

Assuntos

Biologia Computacional/métodos , Doença/genética , Ontologia Genética , Bases de Dados Genéticas , Anotação de Sequência Molecular

8.

Leveraging electronic health record documentation for Failure Mode and Effects Analysis team identification.

Kricke, Gayle Shier; Carson, Matthew B; Lee, Young Ji; Benacka, Corrine; Mutharasan, R Kannan; Ahmad, Faraz S; Kansal, Preeti; Yancy, Clyde W; Anderson, Allen S; Soulakis, Nicholas D.

J Am Med Inform Assoc ; 24(2): 288-294, 2017 Mar 01.

Artigo em Inglês | MEDLINE | ID: mdl-27589944

RESUMO

OBJECTIVE: Using Failure Mode and Effects Analysis (FMEA) as an example quality improvement approach, our objective was to evaluate whether secondary use of orders, forms, and notes recorded by the electronic health record (EHR) during daily practice can enhance the accuracy of process maps used to guide improvement. We examined discrepancies between expected and observed activities and individuals involved in a high-risk process and devised diagnostic measures for understanding discrepancies that may be used to inform quality improvement planning. METHODS: Inpatient cardiology unit staff developed a process map of discharge from the unit. We matched activities and providers identified on the process map to EHR data. Using four diagnostic measures, we analyzed discrepancies between expectation and observation. RESULTS: EHR data showed that 35% of activities were completed by unexpected providers, including providers from 12 categories not identified as part of the discharge workflow. The EHR also revealed sub-components of process activities not identified on the process map. Additional information from the EHR was used to revise the process map and show differences between expectation and observation. CONCLUSION: Findings suggest EHR data may reveal gaps in process maps used for quality improvement and identify characteristics about workflow activities that can identify perspectives for inclusion in an FMEA. Organizations with access to EHR data may be able to leverage clinical documentation to enhance process maps used for quality improvement. While focused on FMEA protocols, findings from this study may be applicable to other quality activities that require process maps.

Assuntos

Serviço Hospitalar de Cardiologia/organização & administração , Registros Eletrônicos de Saúde , Análise do Modo e do Efeito de Falhas na Assistência à Saúde , Melhoria de Qualidade , Documentação/métodos , Humanos , Alta do Paciente

9.

An Outcome-Weighted Network Model for Characterizing Collaboration.

Carson, Matthew B; Scholtens, Denise M; Frailey, Conor N; Gravenor, Stephanie J; Kricke, Gayle E; Soulakis, Nicholas D.

PLoS One ; 11(10): e0163861, 2016.

Artigo em Inglês | MEDLINE | ID: mdl-27706199

RESUMO

Shared patient encounters form the basis of collaborative relationships, which are crucial to the success of complex and interdisciplinary teamwork in healthcare. Quantifying the strength of these relationships using shared risk-adjusted patient outcomes provides insight into interactions that occur between healthcare providers. We developed the Shared Positive Outcome Ratio (SPOR), a novel parameter that quantifies the concentration of positive outcomes between a pair of healthcare providers over a set of shared patient encounters. We constructed a collaboration network using hospital emergency department patient data from electronic health records (EHRs) over a three-year period. Based on an outcome indicating patient satisfaction, we used this network to assess pairwise collaboration and evaluate the SPOR. By comparing this network of 574 providers and 5,615 relationships to a set of networks based on randomized outcomes, we identified 295 (5.2%) pairwise collaborations having significantly higher patient satisfaction rates. Our results show extreme high- and low-scoring relationships over a set of shared patient encounters and quantify high variability in collaboration between providers. We identified 29 top performers in terms of patient satisfaction. Providers in the high-scoring group had both a greater average number of associated encounters and a higher percentage of total encounters with positive outcomes than those in the low-scoring group, implying that more experienced individuals may be able to collaborate more successfully. Our study shows that a healthcare collaboration network can be structurally evaluated to characterize the collaborative interactions that occur between healthcare providers in a hospital setting.

Assuntos

Equipe de Assistência ao Paciente/organização & administração , Satisfação do Paciente/estatística & dados numéricos , Tomada de Decisão Clínica , Comportamento Cooperativo , Registros Eletrônicos de Saúde , Serviço Hospitalar de Emergência , Pessoal de Saúde , Humanos , Modelos Teóricos , Interface Usuário-Computador

10.

Characterizing Teamwork in Cardiovascular Care Outcomes: A Network Analytics Approach.

Carson, Matthew B; Scholtens, Denise M; Frailey, Conor N; Gravenor, Stephanie J; Powell, Emilie S; Wang, Amy Y; Kricke, Gayle Shier; Ahmad, Faraz S; Mutharasan, R Kannan; Soulakis, Nicholas D.

Circ Cardiovasc Qual Outcomes ; 9(6): 670-678, 2016 11.

Artigo em Inglês | MEDLINE | ID: mdl-28051772

RESUMO

BACKGROUND: The nature of teamwork in healthcare is complex and interdisciplinary, and provider collaboration based on shared patient encounters is crucial to its success. Characterizing the intensity of working relationships with risk-adjusted patient outcomes supplies insight into provider interactions in a hospital environment. METHODS AND RESULTS: We extracted 4 years of patient, provider, and activity data for encounters in an inpatient cardiology unit from Northwestern Medicine's Enterprise Data Warehouse. We then created a provider-patient network to identify healthcare providers who jointly participated in patient encounters and calculated satisfaction rates for provider-provider pairs. We demonstrated the application of a novel parameter, the shared positive outcome ratio, a measure that assesses the strength of a patient-sharing relationship between 2 providers based on risk-adjusted encounter outcomes. We compared an observed collaboration network of 334 providers and 3453 relationships to 1000 networks with shared positive outcome ratio scores based on randomized outcomes and found 188 collaborative relationships between pairs of providers that showed significantly higher than expected patient satisfaction ratings. A group of 22 providers performed exceptionally in terms of patient satisfaction. Our results indicate high variability in collaboration scores across the network and highlight our ability to identify relationships with both higher and lower than expected scores across a set of shared patient encounters. CONCLUSIONS: Satisfaction rates seem to vary across different teams of providers. Team collaboration can be quantified using a composite measure of collaboration across provider pairs. Tracking provider pair outcomes over a sufficient set of shared encounters may inform quality improvement strategies such as optimizing team staffing, identifying characteristics and practices of high-performing teams, developing evidence-based team guidelines, and redesigning inpatient care processes.

Assuntos

Serviço Hospitalar de Cardiologia/organização & administração , Doenças Cardiovasculares/terapia , Prestação Integrada de Cuidados de Saúde/organização & administração , Corpo Clínico Hospitalar/organização & administração , Recursos Humanos de Enfermagem Hospitalar/organização & administração , Equipe de Assistência ao Paciente/organização & administração , Avaliação de Processos em Cuidados de Saúde/organização & administração , Doenças Cardiovasculares/diagnóstico , Comportamento Cooperativo , Mineração de Dados/métodos , Bases de Dados Factuais , Humanos , Pacientes Internados , Comunicação Interdisciplinar , Modelos Logísticos , Satisfação do Paciente , Melhoria de Qualidade/normas , Indicadores de Qualidade em Assistência à Saúde/organização & administração , Estudos Retrospectivos , Fatores de Risco , Resultado do Tratamento

11.

Identification of cancer-related genes and motifs in the human gene regulatory network.

Carson, Matthew B; Gu, Jianlei; Yu, Guangjun; Lu, Hui.

IET Syst Biol ; 9(4): 128-34, 2015 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-26243828

RESUMO

The authors investigated the regulatory network motifs and corresponding motif positions of cancer-related genes. First, they mapped disease-related genes to a transcription factor regulatory network. Next, they calculated statistically significant motifs and subsequently identified positions within these motifs that were enriched in cancer-related genes. Potential mechanisms of these motifs and positions are discussed. These results could be used to identify other disease- and cancer-related genes and could also suggest mechanisms for how these genes relate to co-occurring diseases.

Assuntos

Mapeamento Cromossômico/métodos , Genes Neoplásicos/genética , Predisposição Genética para Doença/genética , Família Multigênica/genética , Neoplasias/genética , Fatores de Transcrição/genética , Sequência de Bases , Regulação Neoplásica da Expressão Gênica/genética , Humanos , Dados de Sequência Molecular , Motivos de Nucleotídeos

12.

Network-based prediction and knowledge mining of disease genes.

Carson, Matthew B; Lu, Hui.

BMC Med Genomics ; 8 Suppl 2: S9, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26043920

RESUMO

BACKGROUND: In recent years, high-throughput protein interaction identification methods have generated a large amount of data. When combined with the results from other in vivo and in vitro experiments, a complex set of relationships between biological molecules emerges. The growing popularity of network analysis and data mining has allowed researchers to recognize indirect connections between these molecules. Due to the interdependent nature of network entities, evaluating proteins in this context can reveal relationships that may not otherwise be evident. METHODS: We examined the human protein interaction network as it relates to human illness using the Disease Ontology. After calculating several topological metrics, we trained an alternating decision tree (ADTree) classifier to identify disease-associated proteins. Using a bootstrapping method, we created a tree to highlight conserved characteristics shared by many of these proteins. Subsequently, we reviewed a set of non-disease-associated proteins that were misclassified by the algorithm with high confidence and searched for evidence of a disease relationship. RESULTS: Our classifier was able to predict disease-related genes with 79% area under the receiver operating characteristic (ROC) curve (AUC), which indicates the tradeoff between sensitivity and specificity and is a good predictor of how a classifier will perform on future data sets. We found that a combination of several network characteristics including degree centrality, disease neighbor ratio, eccentricity, and neighborhood connectivity help to distinguish between disease- and non-disease-related proteins. Furthermore, the ADTree allowed us to understand which combinations of strongly predictive attributes contributed most to protein-disease classification. In our post-processing evaluation, we found several examples of potential novel disease-related proteins and corresponding literature evidence. In addition, we showed that first- and second-order neighbors in the PPI network could be used to identify likely disease associations. CONCLUSIONS: We analyzed the human protein interaction network and its relationship to disease and found that both the number of interactions with other proteins and the disease relationship of neighboring proteins helped to determine whether a protein had a relationship to disease. Our classifier predicted many proteins with no annotated disease association to be disease-related, which indicated that these proteins have network characteristics that are similar to disease-related proteins and may therefore have disease associations not previously identified. By performing a post-processing step after the prediction, we were able to identify evidence in literature supporting this possibility. This method could provide a useful filter for experimentalists searching for new candidate protein targets for drug repositioning and could also be extended to include other network and data types in order to refine these predictions.

Assuntos

Biologia Computacional/métodos , Mineração de Dados , Doença/genética , Predisposição Genética para Doença , Mapas de Interação de Proteínas , Algoritmos , Estudos de Associação Genética , Humanos , Curva ROC

13.

Visualizing collaborative electronic health record usage for hospitalized patients with heart failure.

Soulakis, Nicholas D; Carson, Matthew B; Lee, Young Ji; Schneider, Daniel H; Skeehan, Connor T; Scholtens, Denise M.

J Am Med Inform Assoc ; 22(2): 299-311, 2015 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-25710558

RESUMO

OBJECTIVE: To visualize and describe collaborative electronic health record (EHR) usage for hospitalized patients with heart failure. MATERIALS AND METHODS: We identified records of patients with heart failure and all associated healthcare provider record usage through queries of the Northwestern Medicine Enterprise Data Warehouse. We constructed a network by equating access and updates of a patient's EHR to a provider-patient interaction. We then considered shared patient record access as the basis for a second network that we termed the provider collaboration network. We calculated network statistics, the modularity of provider interactions, and provider cliques. RESULTS: We identified 548 patient records accessed by 5113 healthcare providers in 2012. The provider collaboration network had 1504 nodes and 83 998 edges. We identified 7 major provider collaboration modules. Average clique size was 87.9 providers. We used a graph database to demonstrate an ad hoc query of our provider-patient network. DISCUSSION: Our analysis suggests a large number of healthcare providers across a wide variety of professions access records of patients with heart failure during their hospital stay. This shared record access tends to take place not only in a pairwise manner but also among large groups of providers. CONCLUSION: EHRs encode valuable interactions, implicitly or explicitly, between patients and providers. Network analysis provided strong evidence of multidisciplinary record access of patients with heart failure across teams of 100+ providers. Further investigation may lead to clearer understanding of how record access information can be used to strategically guide care coordination for patients hospitalized for heart failure.

Assuntos

Apresentação de Dados , Registros Eletrônicos de Saúde/estatística & dados numéricos , Insuficiência Cardíaca , Reconhecimento Automatizado de Padrão , Mineração de Dados , Pessoal de Saúde , Hospitalização , Humanos , Interface Usuário-Computador

14.

A new exhaustive method and strategy for finding motifs in ChIP-enriched regions.

Jia, Caiyan; Carson, Matthew B; Wang, Yang; Lin, Youfang; Lu, Hui.

PLoS One ; 9(1): e86044, 2014.

Artigo em Inglês | MEDLINE | ID: mdl-24475069

RESUMO

ChIP-seq, which combines chromatin immunoprecipitation (ChIP) with next-generation parallel sequencing, allows for the genome-wide identification of protein-DNA interactions. This technology poses new challenges for the development of novel motif-finding algorithms and methods for determining exact protein-DNA binding sites from ChIP-enriched sequencing data. State-of-the-art heuristic, exhaustive search algorithms have limited application for the identification of short (l, d) motifs (l ≤ 10, d ≤ 2) contained in ChIP-enriched regions. In this work we have developed a more powerful exhaustive method (FMotif) for finding long (l, d) motifs in DNA sequences. In conjunction with our method, we have adopted a simple ChIP-enriched sampling strategy for finding these motifs in large-scale ChIP-enriched regions. Empirical studies on synthetic samples and applications using several ChIP data sets including 16 TF (transcription factor) ChIP-seq data sets and five TF ChIP-exo data sets have demonstrated that our proposed method is capable of finding these motifs with high efficiency and accuracy. The source code for FMotif is available at http://211.71.76.45/FMotif/.

Assuntos

Sítios de Ligação , Imunoprecipitação da Cromatina , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Motivos de Nucleotídeos , Algoritmos , Animais , Proteínas de Ligação a DNA/metabolismo , Células-Tronco Embrionárias , Camundongos , Matrizes de Pontuação de Posição Específica , Sensibilidade e Especificidade , Fatores de Transcrição/metabolismo

15.

A fast weak motif-finding algorithm based on community detection in graphs.

Jia, Caiyan; Carson, Matthew B; Yu, Jian.

BMC Bioinformatics ; 14: 227, 2013 Jul 17.

Artigo em Inglês | MEDLINE | ID: mdl-23865838

RESUMO

BACKGROUND: Identification of transcription factor binding sites (also called 'motif discovery') in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expression/regulation and the low specificity of binding sites. State-of-the-art algorithms have their own constraints (e.g., high time or space complexity for finding long motifs, low precision in identification of weak motifs, or the OOPS constraint: one occurrence of the motif instance per sequence) which limit their scope of application. RESULTS: In this paper, we present a novel and fast algorithm we call TFBSGroup. It is based on community detection from a graph and is used to discover long and weak (l,d) motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence), where l is the length of a motif and d is the maximum number of mutations between a motif instance and the motif itself. Firstly, TFBSGroup transforms the (l, d) motif search in sequences to focus on the discovery of dense subgraphs within a graph. It identifies these subgraphs using a fast community detection method for obtaining coarse-grained candidate motifs. Next, it greedily refines these candidate motifs towards the true motif within their own communities. Empirical studies on synthetic (l, d) samples have shown that TFBSGroup is very efficient (e.g., it can find true (18, 6), (24, 8) motifs within 30 seconds). More importantly, the algorithm has succeeded in rapidly identifying motifs in a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB. The algorithm has also accurately identified motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal. CONCLUSIONS: Our novel heuristic algorithm, TFBSGroup, is able to quickly identify nearly exact matches for long and weak (l, d) motifs in DNA sequences under the ZOMOPS constraint. It is also capable of finding motifs in real applications. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/.

Assuntos

Algoritmos , Regiões Promotoras Genéticas , Análise de Sequência de DNA/métodos , Fatores de Transcrição/metabolismo , Animais , Sítios de Ligação , DNA/química , Escherichia coli K12/genética , Escherichia coli K12/metabolismo , Camundongos , Motivos de Nucleotídeos

16.

Analysis of combinatorial regulation: scaling of partnerships between regulators with the number of governed targets.

Bhardwaj, Nitin; Carson, Matthew B; Abyzov, Alexej; Yan, Koon-Kiu; Lu, Hui; Gerstein, Mark B.

PLoS Comput Biol ; 6(5): e1000755, 2010 May 27.

Artigo em Inglês | MEDLINE | ID: mdl-20523742

RESUMO

Through combinatorial regulation, regulators partner with each other to control common targets and this allows a small number of regulators to govern many targets. One interesting question is that given this combinatorial regulation, how does the number of regulators scale with the number of targets? Here, we address this question by building and analyzing co-regulation (co-transcription and co-phosphorylation) networks that describe partnerships between regulators controlling common genes. We carry out analyses across five diverse species: Escherichia coli to human. These reveal many properties of partnership networks, such as the absence of a classical power-law degree distribution despite the existence of nodes with many partners. We also find that the number of co-regulatory partnerships follows an exponential saturation curve in relation to the number of targets. (For E. coli and Bacillus subtilis, only the beginning linear part of this curve is evident due to arrangement of genes into operons.) To gain intuition into the saturation process, we relate the biological regulation to more commonplace social contexts where a small number of individuals can form an intricate web of connections on the internet. Indeed, we find that the size of partnership networks saturates even as the complexity of their output increases. We also present a variety of models to account for the saturation phenomenon. In particular, we develop a simple analytical model to show how new partnerships are acquired with an increasing number of target genes; with certain assumptions, it reproduces the observed saturation. Then, we build a more general simulation of network growth and find agreement with a wide range of real networks. Finally, we perform various down-sampling calculations on the observed data to illustrate the robustness of our conclusions.

Assuntos

Biologia Computacional/métodos , Redes Reguladoras de Genes , Modelos Genéticos , Modelos Estatísticos , Transcrição Gênica , Animais , Simulação por Computador , Escherichia coli/genética , Regulação da Expressão Gênica , Humanos , Camundongos , Óperon , Fosforilação , Ratos , Leveduras/genética

17.

NAPS: a residue-level nucleic acid-binding prediction server.

Carson, Matthew B; Langlois, Robert; Lu, Hui.

Nucleic Acids Res ; 38(Web Server issue): W431-5, 2010 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-20478832

RESUMO

Nucleic acid-binding proteins are involved in a great number of cellular processes. Understanding the mechanisms underlying these proteins first requires the identification of specific residues involved in nucleic acid binding. Prediction of NA-binding residues can provide practical assistance in the functional annotation of NA-binding proteins. Predictions can also be used to expedite mutagenesis experiments, guiding researchers to the correct binding residues in these proteins. Here, we present a method for the identification of amino acid residues involved in DNA- and RNA-binding using sequence-based attributes. The method used in this work combines the C4.5 algorithm with bootstrap aggregation and cost-sensitive learning. Our DNA-binding model achieved 79.1% accuracy, while the RNA-binding model reached an accuracy of 73.2%. The NAPS web server is freely available at http://proteomics.bioengr.uic.edu/NAPS.

Assuntos

Proteínas de Ligação a DNA/química , Proteínas de Ligação a RNA/química , Software , Algoritmos , Sítios de Ligação , Internet , Reprodutibilidade dos Testes , Análise de Sequência de Proteína

18.

Mining knowledge for the methylation status of CpG islands using alternating decision trees.

Carson, Matthew B; Langlois, Robert; Lu, Hui.

Annu Int Conf IEEE Eng Med Biol Soc ; 2008: 3787-90, 2008.

Artigo em Inglês | MEDLINE | ID: mdl-19163536

RESUMO

CpG island (CpGI) methylation is an epigenetic modification that occurs in eukaryotes and is based on the addition of a methyl group to the number 5 carbon of the pyrimidine ring of cytosine. When methylation of a CpGI occurs, the associated gene (if any) is not expressed [1]. Aberrant methylation is thought to be a causative agent in disease [2] and drug sensitivity [3], [4]. In this work, we have predicted the methylation status of CpGIs in human chromosome 21 using sequence patterns. These patterns showed a significantly different distribution between methylated and unmethylated islands in a previous work [5]. Using C4.5 with bagging and cost-sensitive learning, we achieved 85.6% accuracy, 82.8% sensitivity, and 86.4% specificity.We then constructed 1000 alternating decision trees using a bootstrapping method and analyzed the nodes that were conserved between the trees. This allowed us to find specific combinations of sequence patterns that distinguished between methylated and unmethylated CpGIs. Analysis of these characteristics offers certain insight into the conditions that permit or prevent methylation.

Assuntos

Cromossomos Humanos Par 21 , Ilhas de CpG/genética , Algoritmos , Metilação de DNA , Técnicas de Apoio para a Decisão , Expressão Gênica , Inativação Gênica , Genoma Humano , Humanos , Modelos Estatísticos , Modelos Teóricos , Regiões Promotoras Genéticas , Reprodutibilidade dos Testes

19.

Prediction of specific protein-DNA recognition by knowledge-based two-body and three-body interaction potentials.

Zhao, Guijun; Carson, Matthew B; Lu, Hui.

Annu Int Conf IEEE Eng Med Biol Soc ; 2007: 5017-20, 2007.

Artigo em Inglês | MEDLINE | ID: mdl-18003133

RESUMO

Gene regulation requires specific protein-DNA interactions. Detecting the short and variable DNA sequences in gene promoter regions to which transcription factors (TF) bind is a difficult challenge in bioinformatics. Here we have developed two-body and three-body interaction potentials that are able to assess protein-DNA interaction and achieve a higher level of specificity in the recognition of TF-binding sites. The potentials were calculated using experimentally characterized 3-D structures of protein-DNA complexes. We implemented two approaches in order to evaluate the potentials. Using the first method, we calculated the Z-score of the potential energy of a true TF-binding sequence when compared to 50,000 randomly generated DNA sequences. The second method allowed us to take advantage of the ability of statistical potentials to recognize novel TF-binding sites within the promoter region of genes. We found that the three-body potential, which takes into account the interaction between a DNA base and a protein residue with regard to the effect of a neighboring DNA base, had a better average Z-score than that of the two-body potential. This neighbor effect suggests that the local conformation of DNA does play a critical role in specific residue-base recognition. In all cases, the potentials developed here outperformed published results. The two sets of potentials were tested further by applying them in genome-scale TF-binding site prediction for the CRP protein in E. coli. Out of the 142 cases, 28% of the true binding sites ranked first (i.e.

Assuntos

DNA/química , Proteínas/química , Sítios de Ligação , DNA/genética , Replicação do DNA , Ligação Proteica , Proteínas/genética , Reprodutibilidade dos Testes

20.

Learning to translate sequence and structure to function: identifying DNA binding and membrane binding proteins.

Langlois, Robert E; Carson, Matthew B; Bhardwaj, Nitin; Lu, Hui.

Ann Biomed Eng ; 35(6): 1043-52, 2007 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-17436108

RESUMO

A protein's function depends in a large part on interactions with other molecules. With an increasing number of protein structures becoming available every year, a corresponding structural annotation approach identifying such interactions grows more expedient. At the same time, machine learning has gained popularity in bioinformatics providing robust annotation of genes and proteins without sequence homology. Here we have developed a general machine learning protocol to identify proteins that bind DNA and membrane. In general, there is no theory or even rule of thumb to pick the best machine learning algorithm. Thus, a systematic comparison of several classification algorithms known to perform well is investigated. Indeed, the boosted tree classifier is found to give the best performance, achieving 93% and 88% accuracy to discriminate non-homologous proteins that bind membrane and DNA, respectively, significantly outperforming all previously published works. We also attempted to address the importance of the attributes in function prediction and the relationships between relevant attributes. A graphical model based on boosted trees is applied to study the important features in discriminating DNA-binding proteins. In summary, the current protocol identified physical features important in DNA and membrane binding, rather than annotating function through sequence similarity.

Assuntos

Algoritmos , Inteligência Artificial , Proteínas de Ligação a DNA/química , Proteínas de Membrana/química , Modelos Químicos , Reconhecimento Automatizado de Padrão/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Simulação por Computador , Proteínas de Ligação a DNA/classificação , Proteínas de Ligação a DNA/ultraestrutura , Proteínas de Membrana/classificação , Proteínas de Membrana/ultraestrutura , Modelos Moleculares , Dados de Sequência Molecular , Alinhamento de Sequência/métodos , Homologia de Sequência de Aminoácidos , Relação Estrutura-Atividade

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA