Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 40
Filter
Add more filters

Country/Region as subject
Publication year range
1.
BMC Med Inform Decis Mak ; 24(1): 205, 2024 Jul 24.
Article in English | MEDLINE | ID: mdl-39049015

ABSTRACT

BACKGROUND: Biomedical Relation Extraction (RE) is essential for uncovering complex relationships between biomedical entities within text. However, training RE classifiers is challenging in low-resource biomedical applications with few labeled examples. METHODS: We explore the potential of Shortest Dependency Paths (SDPs) to aid biomedical RE, especially in situations with limited labeled examples. In this study, we suggest various approaches to employ SDPs when creating word and sentence representations under supervised, semi-supervised, and in-context-learning settings. RESULTS: Through experiments on three benchmark biomedical text datasets, we find that incorporating SDP-based representations enhances the performance of RE classifiers. The improvement is especially notable when working with small amounts of labeled data. CONCLUSION: SDPs offer valuable insights into the complex sentence structure found in many biomedical text passages. Our study introduces several straightforward techniques that, as demonstrated experimentally, effectively enhance the accuracy of RE classifiers.


Subject(s)
Data Mining , Natural Language Processing , Humans , Data Mining/methods , Machine Learning
2.
BMC Med Inform Decis Mak ; 22(1): 114, 2022 04 29.
Article in English | MEDLINE | ID: mdl-35488252

ABSTRACT

BACKGROUND: Health providers create Electronic Health Records (EHRs) to describe the conditions and procedures used to treat their patients. Medical notes entered by medical staff in the form of free text are a particularly insightful component of EHRs. There is a great interest in applying machine learning tools on medical notes in numerous medical informatics applications. Learning vector representations, or embeddings, of terms in the notes, is an important pre-processing step in such applications. However, learning good embeddings is challenging because medical notes are rich in specialized terminology, and the number of available EHRs in practical applications is often very small. METHODS: In this paper, we propose a novel algorithm to learn embeddings of medical terms from a limited set of medical notes. The algorithm, called definition2vec, exploits external information in the form of medical term definitions. It is an extension of a skip-gram algorithm that incorporates textual definitions of medical terms provided by the Unified Medical Language System (UMLS) Metathesaurus. RESULTS: To evaluate the proposed approach, we used a publicly available Medical Information Mart for Intensive Care (MIMIC-III) EHR data set. We performed quantitative and qualitative experiments to measure the usefulness of the learned embeddings. The experimental results show that definition2vec keeps the semantically similar medical terms together in the embedding vector space even when they are rare or unobserved in the corpus. We also demonstrate that learned vector embeddings are helpful in downstream medical informatics applications. CONCLUSION: This paper shows that medical term definitions can be helpful when learning embeddings of rare or previously unseen medical terms from a small corpus of specialized documents such as medical notes.


Subject(s)
Electronic Health Records , Unified Medical Language System , Algorithms , Humans , Machine Learning
3.
Cancer Causes Control ; 32(9): 989-999, 2021 Sep.
Article in English | MEDLINE | ID: mdl-34117957

ABSTRACT

PURPOSE: Cutaneous T-cell lymphoma (CTCL) is a rare type of non-Hodgkin lymphoma. Previous studies have reported geographic clustering of CTCL based on the residence at the time of diagnosis. We explore geographic clustering of CTCL using both the residence at the time of diagnosis and past residences using data from the New Jersey State Cancer Registry. METHODS: CTCL cases (n = 1,163) diagnosed between 2006-2014 were matched to colon cancer controls (n = 17,049) on sex, age, race/ethnicity, and birth year. Jacquez's Q-Statistic was used to identify temporal clustering of cases compared to controls. Geographic clustering was assessed using the Bernoulli-based scan-statistic to compare cases to controls, and the Poisson-based scan-statisic to compare the observed number of cases to the number expected based on the general population. Significant clusters (p < 0.05) were mapped, and standard incidence ratios (SIR) reported. We adjusted for diagnosis year, sex, and age. RESULTS: The Q-statistic identified significant temporal clustering of cases based on past residences in the study area from 1992 to 2002. A cluster was detected in 1992 in Bergen County in northern New Jersey based on the Bernoulli (1992 SIR 1.84) and Poisson (1992 SIR 1.86) scan-statistics. Using the Poisson scan-statistic with the diagnosis location, we found evidence of an elevated risk in this same area, but the results were not statistically significant. CONCLUSION: There is evidence of geographic clustering of CTCL cases in New Jersey based on past residences. Additional studies are necessary to understand the possible reasons for the excess of CTCL cases living in this specific area some 8-14 years prior to diagnosis.


Subject(s)
Lymphoma, T-Cell, Cutaneous , Skin Neoplasms , Cluster Analysis , Humans , Incidence , Lymphoma, T-Cell, Cutaneous/diagnosis , Lymphoma, T-Cell, Cutaneous/epidemiology , New Jersey/epidemiology , Skin Neoplasms/epidemiology
4.
Biometrics ; 77(3): 1089-1100, 2021 09.
Article in English | MEDLINE | ID: mdl-32700317

ABSTRACT

The pointwise mutual information statistic (PMI), which measures how often two words occur together in a document corpus, is a cornerstone of recently proposed popular natural language processing algorithms such as word2vec. PMI and word2vec reveal semantic relationships between words and can be helpful in a range of applications such as document indexing, topic analysis, or document categorization. We use probability theory to demonstrate the relationship between PMI and word2vec. We use the theoretical results to demonstrate how the PMI can be modeled and estimated in a simple and straight forward manner. We further describe how one can obtain standard error estimates that account for within-patient clustering that arises from patterns of repeated words within a patient's health record due to a unique health history. We then demonstrate the usefulness of PMI on the problem of predictive identification of disease from free text notes of electronic health records. Specifically, we use our methods to distinguish those with and without type 2 diabetes mellitus in electronic health record free text data using over 400 000 clinical notes from an academic medical center.


Subject(s)
Diabetes Mellitus, Type 2 , Natural Language Processing , Algorithms , Electronic Health Records , Humans
5.
Epidemiology ; 31(5): 728-735, 2020 09.
Article in English | MEDLINE | ID: mdl-32459665

ABSTRACT

BACKGROUND: Residential histories linked to cancer registry data provide new opportunities to examine cancer outcomes by neighborhood socioeconomic status (SES). We examined differences in regional stage colon cancer survival estimates comparing models using a single neighborhood SES at diagnosis to models using neighborhood SES from residential histories. METHODS: We linked regional stage colon cancers from the New Jersey State Cancer Registry diagnosed from 2006 to 2011 to LexisNexis administrative data to obtain residential histories. We defined neighborhood SES as census tract poverty based on location at diagnosis and across the follow-up period through 31 December 2016 based on residential histories (average, time-weighted average, time-varying). Using Cox proportional hazards regression, we estimated associations between colon cancer and census tract poverty measurements (continuous and categorical), adjusted for age, sex, race/ethnicity, regional substage, and mover status. RESULTS: Sixty-five percent of the sample was nonmovers (one census tract); 35% (movers) changed tract at least once. Cases from tracts with >20% poverty changed residential tracts more often (42%) than cases from tracts with <5% poverty (32%). Hazard ratios (HRs) were generally similar in strength and direction across census tract poverty measurements. In time-varying models, cases in the highest poverty category (>20%) had a 30% higher risk of regional stage colon cancer death than cases in the lowest category (<5%) (95% confidence interval [CI] = 1.04, 1.63). CONCLUSION: Residential changes after regional stage colon cancer diagnosis may be associated with a higher risk of colon cancer death among cases in high-poverty areas. This has important implications for postdiagnostic access to care for treatment and follow-up surveillance. See video abstract: http://links.lww.com/EDE/B705.


Subject(s)
Colonic Neoplasms , Health Status Disparities , Poverty Areas , Residence Characteristics , Colonic Neoplasms/epidemiology , Humans , New Jersey/epidemiology , Residence Characteristics/statistics & numerical data , Socioeconomic Factors , Survival Analysis
6.
PLoS Comput Biol ; 15(3): e1006844, 2019 03.
Article in English | MEDLINE | ID: mdl-30845191

ABSTRACT

Protein loops connect regular secondary structures and contain 4-residue beta turns which represent 63% of the residues in loops. The commonly used classification of beta turns (Type I, I', II, II', VIa1, VIa2, VIb, and VIII) was developed in the 1970s and 1980s from analysis of a small number of proteins of average resolution, and represents only two thirds of beta turns observed in proteins (with a generic class Type IV representing the rest). We present a new clustering of beta-turn conformations from a set of 13,030 turns from 1074 ultra-high resolution protein structures (≤1.2 Å). Our clustering is derived from applying the DBSCAN and k-medoids algorithms to this data set with a metric commonly used in directional statistics applied to the set of dihedral angles from the second and third residues of each turn. We define 18 turn types compared to the 8 classical turn types in common use. We propose a new 2-letter nomenclature for all 18 beta-turn types using Ramachandran region names for the two central residues (e.g., 'A' and 'D' for alpha regions on the left side of the Ramachandran map and 'a' and 'd' for equivalent regions on the right-hand side; classical Type I turns are 'AD' turns and Type I' turns are 'ad'). We identify 11 new types of beta turn, 5 of which are sub-types of classical beta-turn types. Up-to-date statistics, probability densities of conformations, and sequence profiles of beta turns in loops were collected and analyzed. A library of turn types, BetaTurnLib18, and cross-platform software, BetaTurnTool18, which identifies turns in an input protein structure, are freely available and redistributable from dunbrack.fccc.edu/betaturn and github.com/sh-maxim/BetaTurn18. Given the ubiquitous nature of beta turns, this comprehensive study updates understanding of beta turns and should also provide useful tools for protein structure determination, refinement, and prediction programs.


Subject(s)
Proteins/chemistry , Terminology as Topic , Algorithms , Amino Acid Sequence , Amino Acids/chemistry , Cluster Analysis , Protein Conformation , Reproducibility of Results
7.
BMC Med Inform Decis Mak ; 18(Suppl 4): 123, 2018 12 12.
Article in English | MEDLINE | ID: mdl-30537974

ABSTRACT

BACKGROUND: There has been an increasing interest in learning low-dimensional vector representations of medical concepts from Electronic Health Records (EHRs). Vector representations of medical concepts facilitate exploratory analysis and predictive modeling of EHR data to gain insights about the patterns of care and health outcomes. EHRs contain structured data such as diagnostic codes and laboratory tests, as well as unstructured free text data in form of clinical notes, which provide more detail about condition and treatment of patients. METHODS: In this work, we propose a method that jointly learns vector representations of medical concepts and words. This is achieved by a novel learning scheme based on the word2vec model. Our model learns those relationships by integrating clinical notes and sets of accompanying medical codes and by defining joint contexts for each observed word and medical code. RESULTS: In our experiments, we learned joint representations using MIMIC-III data. Using the learned representations of words and medical codes, we evaluated phenotypes for 6 diseases discovered by our and baseline method. The experimental results show that for each of the 6 diseases our method finds highly relevant words. We also show that our representations can be very useful when predicting the reason for the next visit. CONCLUSIONS: The jointly learned representations of medical concepts and words capture not only similarity between codes or words themselves, but also similarity between codes and words. They can be used to extract phenotypes of different diseases. The representations learned by the joint model are also useful for construction of patient features.


Subject(s)
Electronic Health Records , Machine Learning , Natural Language Processing , Clinical Coding , Humans , Phenotype , Terminology as Topic , Vocabulary
8.
Am J Speech Lang Pathol ; 33(3): 1174-1192, 2024 May.
Article in English | MEDLINE | ID: mdl-38290536

ABSTRACT

PURPOSE: Augmentative and alternative communication (AAC) technology innovation is urgently needed to improve outcomes for children on the autism spectrum who are minimally verbal. One potential technology innovation is applying artificial intelligence (AI) to automate strategies such as augmented input to increase language learning opportunities while mitigating communication partner time and learning barriers. Innovation in AAC research and design methodology is also needed to empirically explore this and other applications of AI to AAC. The purpose of this report was to describe (a) the development of an AAC prototype using a design methodology new to AAC research and (b) a preliminary investigation of the efficacy of this potential new AAC capability. METHOD: The prototype was developed using a Wizard-of-Oz prototyping approach that allows for initial exploration of a new technology capability without the time and effort required for full-scale development. The preliminary investigation with three children on the autism spectrum who were minimally verbal used an adapted alternating treatment design to compare the effects of a Wizard-of-Oz prototype that provided automated augmented input (i.e., pairing color photos with speech) to a standard topic display (i.e., a grid display with line drawings) on visual attention, linguistic participation, and (for one participant) word learning during a circle activity. RESULTS: Preliminary investigation results were variable, but overall participants increased visual attention and linguistic participation when using the prototype. CONCLUSIONS: Wizard-of-Oz prototyping could be a valuable approach to spur much needed innovation in AAC. Further research into efficacy, reliability, validity, and attitudes is required to more comprehensively evaluate the use of AI to automate augmented input in AAC.


Subject(s)
Autism Spectrum Disorder , Communication Aids for Disabled , Humans , Autism Spectrum Disorder/therapy , Male , Child , Female , Artificial Intelligence , Child, Preschool , Child Language , Preliminary Data
9.
BMC Bioinformatics ; 14 Suppl 3: S8, 2013.
Article in English | MEDLINE | ID: mdl-23514608

ABSTRACT

BACKGROUND: Protein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. A cost-effective alternative is to use the known information about sequence, structure, and functional properties of genes and proteins to predict functions using statistical methods. In this paper, we describe the Multi-Source k-Nearest Neighbor (MS-kNN) algorithm for function prediction, which finds k-nearest neighbors of a query protein based on different types of similarity measures and predicts its function by weighted averaging of its neighbors' functions. Specifically, we used 3 data sources to calculate the similarity scores: sequence similarity, protein-protein interactions, and gene expressions. RESULTS: We report the results in the context of 2011 Critical Assessment of Function Annotation (CAFA). Prior to CAFA submission deadline, we evaluated our algorithm on 1,302 human test proteins that were represented in all 3 data sources. Using only the sequence similarity information, MS-kNN had term-based Area Under the Curve (AUC) accuracy of Gene Ontology (GO) molecular function predictions of 0.728 when 7,412 human training proteins were used, and 0.819 when 35,622 training proteins from multiple eukaryotic and prokaryotic organisms were used. By aggregating predictions from all three sources, the AUC was further improved to 0.848. Similar result was observed on prediction of GO biological processes. Testing on 595 proteins that were annotated after the CAFA submission deadline showed that overall MS-kNN accuracy was higher than that of baseline algorithms Gotcha and BLAST, which were based solely on sequence similarity information. Since only 10 of the 595 proteins were represented by all 3 data sources, and 66 by two data sources, the difference between 3-source and one-source MS-kNN was rather small. CONCLUSIONS: Based on our results, we have several useful insights: (1) the k-nearest neighbor algorithm is an efficient and effective model for protein function prediction; (2) it is beneficial to transfer functions across a wide range of organisms; (3) it is helpful to integrate multiple sources of protein information.


Subject(s)
Algorithms , Proteins/physiology , Genomics , Humans , Protein Interaction Mapping , Proteins/chemistry , Proteins/genetics , Sequence Analysis, Protein , Transcriptome , Vocabulary, Controlled
10.
Methodology (Gott) ; 19(1): 43-59, 2023.
Article in English | MEDLINE | ID: mdl-37090814

ABSTRACT

Identification of procedures using International Classification of Diseases or Healthcare Common Procedure Coding System codes is challenging when conducting medical claims research. We demonstrate how Pointwise Mutual Information can be used to find associated codes. We apply the method to an investigation of racial differences in breast cancer outcomes. We used Surveillance Epidemiology and End Results (SEER) data linked to Medicare claims. We identified treatment using two methods. First, we used previously published definitions. Second, we augmented definitions using codes empirically identified by the Pointwise Mutual Information statistic. Similar to previous findings, we found that presentation differences between Black and White women closed much of the estimated survival curve gap. However, we found that survival disparities were completely eliminated with the augmented treatment definitions. We were able to control for a wider range of treatment patterns that might affect survival differences between Black and White women with breast cancer.

11.
Cancer Rep (Hoboken) ; 6(5): e1805, 2023 05.
Article in English | MEDLINE | ID: mdl-36943210

ABSTRACT

BACKGROUND: Additional evaluations, including second opinions, before breast cancer surgery may improve care, but may cause detrimental treatment delays that could allow disease progression. AIMS: We investigate the timing of surgical delays that are associated with survival benefits conferred by preoperative encounters versus the timing that are associated with potential harm. METHODS AND RESULTS: We investigated survival outcomes of SEER Medicare patients with stage 1-3 breast cancer using propensity score-based weighting. We examined interactions between the number of preoperative evaluation components and time from biopsy to definitive surgery. Components include new patient visits, unique surgeons, medical oncologists, or radiation oncologists consulted, established patient encounters, biopsies, and imaging studies. We identified 116 050 cases of whom 99% were female and had an average age of 75.0 (SD = 6.2). We found that new patient visits have a protective association with respect to breast cancer mortality if they occur quickly after diagnosis with breast cancer mortality subdistribution Hazard Ratios [sHRs] = 0.87 (95% Confidence Interval [CI] 0.76-1.00) for 2, 0.71 (CI 0.55-0.92) for 3, and 0.63 (CI 0.37-1.07) for 4+ visits at minimal delay. New patient visits predict worsened mortality compared with no visits if the surgical delay is greater than 33 days (CI 14-53) for 2, 33 days (CI 17-49) for 3, and 44 days (CI 12-75) for 4+. Medical oncologist visits predict worse outcomes if the surgical delay is greater than 29 days (CI 20-39) for 1 and 38 days (CI 12-65) for 2+ visits. Similarly, surgeon encounters switch from a positive to a negative association if the surgical delay exceeds 29 days (CI 17-41) for 1 visit, but the positive estimate persists over time for 3+ surgeon visits. CONCLUSION: Preoperative visits that cause substantial delays may be associated with increased mortality in older patients with breast cancer.


Subject(s)
Breast Neoplasms , Humans , Female , Aged , United States , Male , Breast Neoplasms/diagnosis , Breast Neoplasms/surgery , Breast Neoplasms/pathology , Medicare , Referral and Consultation , Mastectomy/adverse effects , Proportional Hazards Models
12.
AMIA Annu Symp Proc ; 2022: 425-431, 2022.
Article in English | MEDLINE | ID: mdl-37128402

ABSTRACT

Relation Extraction (RE) is an important task in extracting structured data from free biomedical text. Obtaining labeled data needed to train RE models in specialized domains such as biomedicine can be very expensive because it requires expert knowledge. Thus, it is often the case that RE models need to be trained from relatively small labeled data sets. Despite the recent advances in Natural Language Processing (NLP) approaches for RE, training accurate RE models from small labeled data is still an open challenge. In this paper, we propose MERIT, a simple and effective approach for label augmentation that automatically increases the size of labeled data while introducing a moderate labeling noise. We performed extensive experiments on three benchmarks biomedical RE data sets. The results demonstrate the effectiveness of MERIT compared to the baseline.


Subject(s)
Natural Language Processing , Humans
13.
Proc ACM Int Conf Inf Knowl Manag ; 2022: 4828-4832, 2022 Oct.
Article in English | MEDLINE | ID: mdl-36636516

ABSTRACT

Healthcare providers generate a medical claim after every patient visit. A medical claim consists of a list of medical codes describing the diagnosis and any treatment provided during the visit. Medical claims have been popular in medical research as a data source for retrospective cohort studies. This paper introduces a medical claim visualization system (MedCV) that supports cohort selection from medical claim data. MedCV was developed as part of a design study in collaboration with clinical researchers and statisticians. It helps a researcher to define inclusion rules for cohort selection by revealing relationships between medical codes and visualizing medical claims and patient timelines. Evaluation of our system through a user study indicates that MedCV enables domain experts to define high-quality inclusion rules in a time-efficient manner.

14.
SSM Popul Health ; 17: 101023, 2022 Mar.
Article in English | MEDLINE | ID: mdl-35097183

ABSTRACT

Given the growing number of cancer survivors, it is important to better understand socio-spatial mobility patterns of cancer patients after diagnosis that could have public health implications regarding post-diagnostic access to care for treatment and follow-up surveillance. In this exploratory study, residential histories from LexisNexis were linked to New Jersey colon cancer cases diagnosed from 2006 to 2011 to examine differences in socio-spatial mobility patterns after diagnosis by stage at cancer diagnosis, sex, and race/ethnicity. For the colon cancer cases, we summarized and compared the number of residences and changes in the residential census tract and neighborhood poverty after the diagnosis. We found only minor changes in neighborhood poverty among the cases during the follow-up period after diagnosis. During the follow-up period of up to 10 years after diagnosis, 67% of the patients did not move to a different residential census tract, and 10.8% moved from New Jersey to another state. Cases that moved to a different census tract changed after diagnosis were generally less wealthy than non-movers, but the destination of relocation varied by race/ethnicity and socioeconomic status. We also found a significant association between residential mobility and stage at diagnosis, whereby patients diagnosed with colon cancer at an early stage were more likely to be movers. This study contributes to understanding of the socio-spatial mobility patterns in colon cancer patients and may help to inform cancer research by summarizing the extent to which colon cancer patients move after diagnosis.

15.
Nat Commun ; 12(1): 6302, 2021 11 02.
Article in English | MEDLINE | ID: mdl-34728624

ABSTRACT

Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the "generative capacity" of three current GPSMs: the pairwise Potts Hamiltonian, the VAE, and the site-independent model. We show that the Potts model's generative capacity is largest, as the higher-order mutational statistics generated by the model agree with those observed for natural sequences, while the VAE's lies between the Potts and site-independent models. Importantly, our work provides a new framework for evaluating and interpreting GPSM accuracy which emphasizes the role of higher-order covariation and epistasis, with broader implications for probabilistic sequence models in general.


Subject(s)
Mutation , Proteins/chemistry , Sequence Alignment/methods , Algorithms , Amino Acid Sequence , Computer Simulation , Databases, Protein , Humans , Models, Statistical , Protein Structural Elements , Proteins/genetics , Structure-Activity Relationship
16.
Article in English | MEDLINE | ID: mdl-33946680

ABSTRACT

Landscape characteristics have been shown to influence health outcomes, but few studies have examined their relationship with cancer survival. We used data from the National Land Cover Database to examine associations between regional-stage colon cancer survival and 27 different landscape metrics. The study population included all adult New Jersey residents diagnosed between 2006 and 2011. Cases were followed until 31 December 2016 (N = 3949). Patient data were derived from the New Jersey State Cancer Registry and were linked to LexisNexis to obtain residential histories. Cox proportional hazard regression was used to estimate hazard ratios (HR) and 95% confidence intervals (CI95) for the different landscape metrics. An increasing proportion of high-intensity developed lands with 80-100% impervious surfaces per cell/pixel was significantly associated with the risk of colon cancer death (HR = 1.006; CI95 = 1.002-1.01) after controlling for neighborhood poverty and other individual-level factors. In contrast, an increase in the aggregation and connectivity of vegetation-dominated low-intensity developed lands with 20-<40% impervious surfaces per cell/pixel was significantly associated with the decrease in risk of death from colon cancer (HR = 0.996; CI95 = 0.992-0.999). Reducing impervious surfaces in residential areas may increase the aesthetic value and provide conditions more advantageous to a healthy lifestyle, such as walking. Further research is needed to understand how these landscape characteristics impact survival.


Subject(s)
Colonic Neoplasms , Residence Characteristics , Adult , Colonic Neoplasms/epidemiology , Humans , New Jersey/epidemiology , Poverty , Proportional Hazards Models
17.
Sci Adv ; 7(17)2021 Apr.
Article in English | MEDLINE | ID: mdl-33883136

ABSTRACT

Incorporation of physical principles in a machine learning (ML) architecture is a fundamental step toward the continued development of artificial intelligence for inorganic materials. As inspired by the Pauling's rule, we propose that structure motifs in inorganic crystals can serve as a central input to a machine learning framework. We demonstrated that the presence of structure motifs and their connections in a large set of crystalline compounds can be converted into unique vector representations using an unsupervised learning algorithm. To demonstrate the use of structure motif information, a motif-centric learning framework is created by combining motif information with the atom-based graph neural networks to form an atom-motif dual graph network (AMDNet), which is more accurate in predicting the electronic structures of metal oxides such as bandgaps. The work illustrates the route toward fundamental design of graph neural network learning architecture for complex materials by incorporating beyond-atom physical principles.

18.
PLoS One ; 15(5): e0232528, 2020.
Article in English | MEDLINE | ID: mdl-32374785

ABSTRACT

Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.


Subject(s)
Neural Networks, Computer , Protein Structure, Secondary , Algorithms , Amino Acid Sequence , Amino Acids/chemistry , Databases, Protein/statistics & numerical data , Deep Learning , Proteins/chemistry , Software
19.
Cancer Epidemiol Biomarkers Prev ; 29(11): 2119-2125, 2020 11.
Article in English | MEDLINE | ID: mdl-32759382

ABSTRACT

BACKGROUND: Identifying geospatial cancer survival disparities is critical to focus interventions and prioritize efforts with limited resources. Incorporating residential mobility into spatial models may result in different geographic patterns of survival compared with the standard approach using a single location based on the patient's residence at the time of diagnosis. METHODS: Data on 3,949 regional-stage colon cancer cases diagnosed from 2006 to 2011 and followed until December 31, 2016, were obtained from the New Jersey State Cancer Registry. Geographic disparity based on the spatial variance and effect sizes from a Bayesian spatial model using residence at diagnosis was compared with a time-varying spatial model using residential histories [adjusted for sex, gender, substage, race/ethnicity, and census tract (CT) poverty]. Geographic estimates of risk of colon cancer death were mapped. RESULTS: Most patients (65%) remained at the same residence, 22% changed CT, and 12% moved out of state. The time-varying model produced a wider range of adjusted risk of colon cancer death (0.85-1.20 vs. 0.94-1.11) and resulted in greater geographic disparity statewide after adjustment (25.5% vs. 14.2%) compared with the model with only the residence at diagnosis. CONCLUSIONS: Including residential mobility may allow for more precise estimates of spatial risk of death. Results based on the traditional approach using only residence at diagnosis were not substantially different for regional stage colon cancer in New Jersey. IMPACT: Including residential histories opens up new avenues of inquiry to better understand the complex relationships between people and places, and the effect of residential mobility on cancer outcomes.See related commentary by Williams, p. 2107.


Subject(s)
Colonic Neoplasms , Residence Characteristics , Bayes Theorem , Colonic Neoplasms/epidemiology , Humans , New Jersey/epidemiology , Population Dynamics
20.
BMC Genomics ; 10 Suppl 1: S7, 2009 Jul 07.
Article in English | MEDLINE | ID: mdl-19594884

ABSTRACT

BACKGROUND: Intrinsically disordered proteins (IDPs) and intrinsically disordered regions (IDRs) lack stable tertiary and/or secondary structure yet fulfills key biological functions. The recent recognition of IDPs and IDRs is leading to an entire field aimed at their systematic structural characterization and at determination of their mechanisms of action. Bioinformatics studies showed that IDPs and IDRs are highly abundant in different proteomes and carry out mostly regulatory functions related to molecular recognition and signal transduction. These activities complement the functions of structured proteins. IDPs and IDRs were shown to participate in both one-to-many and many-to-one signaling. Alternative splicing and posttranslational modifications are frequently used to tune the IDP functionality. Several individual IDPs were shown to be associated with human diseases, such as cancer, cardiovascular disease, amyloidoses, diabetes, neurodegenerative diseases, and others. This raises questions regarding the involvement of IDPs and IDRs in various diseases. RESULTS: IDPs and IDRs were shown to be highly abundant in proteins associated with various human maladies. As the number of IDPs related to various diseases was found to be very large, the concepts of the disease-related unfoldome and unfoldomics were introduced. Novel bioinformatics tools were proposed to populate and characterize the disease-associated unfoldome. Structural characterization of the members of the disease-related unfoldome requires specialized experimental approaches. IDPs possess a number of unique structural and functional features that determine their broad involvement into the pathogenesis of various diseases. CONCLUSION: Proteins associated with various human diseases are enriched in intrinsic disorder. These disease-associated IDPs and IDRs are real, abundant, diversified, vital, and dynamic. These proteins and regions comprise the disease-related unfoldome, which covers a significant part of the human proteome. Profound association between intrinsic disorder and various human diseases is determined by a set of unique structural and functional characteristics of IDPs and IDRs. Unfoldomics of human diseases utilizes unrivaled bioinformatics and experimental techniques, paves the road for better understanding of human diseases, their pathogenesis and molecular mechanisms, and helps develop new strategies for the analysis of disease-related proteins.


Subject(s)
Computational Biology/methods , Protein Folding , Proteins/chemistry , Proteins/metabolism , Alternative Splicing , Humans , Protein Processing, Post-Translational , Protein Structure, Secondary , Protein Structure, Tertiary , Structure-Activity Relationship
SELECTION OF CITATIONS
SEARCH DETAIL