RESUMEN
BACKGROUND: Here we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. RESULTS: In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resulting functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. CONCLUSIONS: PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequence-based genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.
Asunto(s)
Genoma Bacteriano , Herbaspirillum/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Internet , Anotación de Secuencia Molecular/métodos , Programas Informáticos , Biología Computacional/métodos , Computadores , Microbiología del AguaRESUMEN
Previous cross-sectional analyses demonstrated that CD8(+) and CD4(+) T-cell reactivity to islet-specific antigens was more prevalent in T1D subjects than in healthy donors (HD). Here, we examined T1D-associated epitope-specific CD4(+) T-cell cytokine production and autoreactive CD8(+) T-cell frequency on a monthly basis for one year in 10 HD, 33 subjects with T1D, and 15 subjects with T2D. Autoreactive CD4(+) T-cells from both T1D and T2D subjects produced more IFN-γ when stimulated than cells from HD. In contrast, higher frequencies of islet antigen-specific CD8(+) T-cells were detected only in T1D. These observations support the hypothesis that general beta-cell stress drives autoreactive CD4(+) T-cell activity while islet over-expression of MHC class I commonly seen in T1D mediates amplification of CD8(+) T-cells and more rapid beta-cell loss. In conclusion, CD4(+) T-cell autoreactivity appears to be present in both T1D and T2D while autoreactive CD8(+) T-cells are unique to T1D. Thus, autoreactive CD8(+) cells may serve as a more T1D-specific biomarker.
Asunto(s)
Autoantígenos/inmunología , Linfocitos T CD4-Positivos/inmunología , Linfocitos T CD8-positivos/inmunología , Diabetes Mellitus Tipo 1/inmunología , Diabetes Mellitus Tipo 2/inmunología , Islotes Pancreáticos/inmunología , Adulto , Anciano , Linfocitos T CD4-Positivos/patología , Linfocitos T CD8-positivos/patología , Estudios de Casos y Controles , Citotoxicidad Inmunológica , Diabetes Mellitus Tipo 1/patología , Diabetes Mellitus Tipo 2/patología , Ensayo de Immunospot Ligado a Enzimas , Femenino , Humanos , Interferón gamma/biosíntesis , Islotes Pancreáticos/patología , Estudios Longitudinales , Masculino , Persona de Mediana EdadRESUMEN
BACKGROUND: Methods of weakening and attenuating pathogens' abilities to infect and propagate in a host, thus allowing the natural immune system to more easily decimate invaders, have gained attention as alternatives to broad-spectrum targeting approaches. The following work describes a technique to identifying proteins involved in virulence by relying on latent information computationally gathered across biological repositories, applicable to both generic and specific virulence categories. RESULTS: A lightweight method for data integration is used, which links information regarding a protein via a path-based query graph. A method of weighting is then applied to query graphs that can serve as input to various statistical classification methods for discrimination, and the combined usage of both data integration and learning methods are tested against the problem of both generalized and specific virulence function prediction. CONCLUSIONS: This approach improves coverage of functional data over a protein. Moreover, while depending largely on noisy and potentially non-curated data from public sources, we find it outperforms other techniques to identification of general virulence factors and baseline remote homology detection methods for specific virulence categories.
Asunto(s)
Proteínas/clasificación , Análisis de Secuencia de Proteína/métodos , Análisis de Secuencia de Proteína/estadística & datos numéricos , Factores de Virulencia/clasificación , Interpretación Estadística de Datos , Bases de Datos de Proteínas , Proteínas/química , Virulencia , Factores de Virulencia/químicaRESUMEN
Though there have been many advances in providing access to linked and integrated biomedical data across repositories, developing methods which allow users to specify ambiguous and exploratory queries over disparate sources remains a challenge to extracting well-curated or diversely-supported biological information. In the following work, we discuss the concepts of data coverage and evidence in the context of integrated sources. We address diverse information retrieval via a simple framework for representing coverage and evidence that operates in parallel with an arbitrary schema, and a language upon which queries on the schema and framework may be executed. We show that this approach is capable of answering questions that require ranged levels of evidence or triangulation, and demonstrate that appropriately-formed queries can significantly improve the level of precision when retrieving well-supported biomedical data.
Asunto(s)
Bases de Datos Factuales , Almacenamiento y Recuperación de la Información/métodos , Investigación Biomédica , Internet , SemánticaRESUMEN
Genome wide association studies (GWAS) are an important approach to understanding the genetic mechanisms behind human diseases. Single nucleotide polymorphisms (SNPs) are the predominant markers used in genome wide association studies, and the ability to predict which SNPs are likely to be functional is important for both a priori and a posteriori analyses of GWA studies. This article describes the design, implementation and evaluation of a family of systems for the purpose of identifying SNPs that may cause a change in phenotypic outcomes. The methods described in this article characterize the feasibility of combinations of logical and probabilistic inference with federated data integration for both point and regional SNP annotation and analysis. Evaluations of the methods demonstrate the overall strong predictive value of logical, and logical with probabilistic, inference applied to the domain of SNP annotation.
Asunto(s)
Modelos Estadísticos , Polimorfismo de Nucleótido Simple , Bases de Datos Genéticas , Estudio de Asociación del Genoma Completo/métodos , LógicaRESUMEN
BACKGROUND: Genes conferring antibiotic resistance to groups of bacterial pathogens are cause for considerable concern, as many once-reliable antibiotics continue to see a reduction in efficacy. The recent discovery of the metallo ß-lactamase blaNDM-1 gene, which appears to grant antibiotic resistance to a variety of Enterobacteriaceae via a mobile plasmid, is one example of this distressing trend. The following work describes a computational analysis of pathogen-borne MBLs that focuses on the structural aspects of characterized proteins. RESULTS: Using both sequence and structural analyses, we examine residues and structural features specific to various pathogen-borne MBL types. This analysis identifies a linker region within MBL-like folds that may act as a discriminating structural feature between these proteins, and specifically resistance-associated acquirable MBLs. Recently released crystal structures of the newly emerged NDM-1 protein were aligned against related MBL structures using a variety of global and local structural alignment methods, and the overall fold conformation is examined for structural conservation. Conservation appears to be present in most areas of the protein, yet is strikingly absent within a linker region, making NDM-1 unique with respect to a linker-based classification scheme. Variability analysis of the NDM-1 crystal structure highlights unique residues in key regions as well as identifying several characteristics shared with other transferable MBLs. CONCLUSIONS: A discriminating linker region identified in MBL proteins is highlighted and examined in the context of NDM-1 and primarily three other MBL types: IMP-1, VIM-2 and ccrA. The presence of an unusual linker region variant and uncommon amino acid composition at specific structurally important sites may help to explain the unusually broad kinetic profile of NDM-1 and may aid in directing research attention to areas of this protein, and possibly other MBLs, that may be targeted for inactivation or attenuation of enzymatic activity.
RESUMEN
BACKGROUND: Extracting medication information from clinical records has many potential applications, and recently published research, systems, and competitions reflect an interest therein. Much of the early extraction work involved rules and lexicons, but more recently machine learning has been applied to the task. METHODS: We present a hybrid system consisting of two parts. The first part, field detection, uses a cascade of statistical classifiers to identify medication-related named entities. The second part uses simple heuristics to link those entities into medication events. RESULTS: The system achieved performance that is comparable to other approaches to the same task. This performance is further improved by adding features that reference external medication name lists. CONCLUSIONS: This study demonstrates that our hybrid approach outperforms purely statistical or rule-based systems. The study also shows that a cascade of classifiers works better than a single classifier in extracting medication information. The system is available as is upon request from the first author.
RESUMEN
The Third i2b2 Workshop on Natural Language Processing Challenges for Clinical Records focused on the identification of medications, their dosages, modes (routes) of administration, frequencies, durations, and reasons for administration in discharge summaries. This challenge is referred to as the medication challenge. For the medication challenge, i2b2 released detailed annotation guidelines along with a set of annotated discharge summaries. Twenty teams representing 23 organizations and nine countries participated in the medication challenge. The teams produced rule-based, machine learning, and hybrid systems targeted to the task. Although rule-based systems dominated the top 10, the best performing system was a hybrid. Of all medication-related fields, durations and reasons were the most difficult for all systems to detect. While medications themselves were identified with better than 0.75 F-measure by all of the top 10 systems, the best F-measure for durations and reasons were 0.525 and 0.459, respectively. State-of-the-art natural language processing systems go a long way toward extracting medication names, dosages, modes, and frequencies. However, they are limited in recognizing duration and reason fields and would benefit from future research.
Asunto(s)
Registros Electrónicos de Salud , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Preparaciones Farmacéuticas , Computadores Híbridos , Humanos , Pacientes Desistentes del TratamientoRESUMEN
OBJECTIVE: Within the context of the Third i2b2 Workshop on Natural Language Processing Challenges for Clinical Records, the authors (also referred to as 'the i2b2 medication challenge team' or 'the i2b2 team' for short) organized a community annotation experiment. DESIGN: For this experiment, the authors released annotation guidelines and a small set of annotated discharge summaries. They asked the participants of the Third i2b2 Workshop to annotate 10 discharge summaries per person; each discharge summary was annotated by two annotators from two different teams, and a third annotator from a third team resolved disagreements. MEASUREMENTS: In order to evaluate the reliability of the annotations thus produced, the authors measured community inter-annotator agreement and compared it with the inter-annotator agreement of expert annotators when both the community and the expert annotators generated ground truth based on pooled system outputs. For this purpose, the pool consisted of the three most densely populated automatic annotations of each record. The authors also compared the community inter-annotator agreement with expert inter-annotator agreement when the experts annotated raw records without using the pool. Finally, they measured the quality of the community ground truth by comparing it with the expert ground truth. RESULTS AND CONCLUSIONS: The authors found that the community annotators achieved comparable inter-annotator agreement to expert annotators, regardless of whether the experts annotated from the pool. Furthermore, the ground truth generated by the community obtained F-measures above 0.90 against the ground truth of the experts, indicating the value of the community as a source of high-quality ground truth even on intricate and domain-specific annotation tasks.
Asunto(s)
Registros Electrónicos de Salud , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Preparaciones Farmacéuticas , Humanos , Alta del PacienteRESUMEN
In the following work, we test a generalized approach to integrating, transforming and learning data from disparate data sources for the classification of bacterial proteins involved in pathogenesis. We rely on the implicit inter-linkages between biological databases to draw relevant records, and leverage statistical learning methods to infer classification based on abundant, albeit noisy, data. Results suggest that types of public biological information have varying degrees of effectiveness in predictive data mining.
Asunto(s)
Inteligencia Artificial , Proteínas Bacterianas/clasificación , Toxinas Bacterianas/clasificación , Bases de Datos de Proteínas , Reconocimiento de Normas Patrones Automatizadas/métodos , Terminología como Asunto , Factores de Virulencia/clasificación , Algoritmos , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje NaturalRESUMEN
Scientists working on genomics projects are often faced with the difficult task of sifting through large amounts of biological information dispersed across various online data sources that are relevant to their area or organism of research. Gene annotation, the process of identifying the functional role of a possible gene, in particular has become increasingly more time-consuming and laborious to conduct as more genomes are sequenced and the number of candidate genes continues to increase at near-exponential pace; genes are left un-annotated, or worse, incorrectly annotated. Many groups have attempted to address the annotation backlog through automated annotation systems that are geared toward specific organisms, and which may thus not possess the necessary flexibility and scalability to annotate other genomes. In this paper, we present a method and framework which attempts to address problems inherent in manual and automatic annotation by coupling a data integration system, BioMediator, to an inference engine with the aim of elucidating functional annotations. The framework and heuristics developed are not specific to any particular genome. We validated the method with a set of randomly-selected annotated sequences from a variety of organisms. Preliminary results show that the hybrid data integration and inference approach generates functional annotations that are as good as or better than "gold standard" annotations approximately 80% of the time.
Asunto(s)
Biología Computacional , Bases de Datos Genéticas , Genómica/estadística & datos numéricos , Sistemas de Computación , Interpretación Estadística de Datos , Sistemas Especialistas , Programas InformáticosRESUMEN
A comparison of gene content and genome architecture of Trypanosoma brucei, Trypanosoma cruzi, and Leishmania major, three related pathogens with different life cycles and disease pathology, revealed a conserved core proteome of about 6200 genes in large syntenic polycistronic gene clusters. Many species-specific genes, especially large surface antigen families, occur at nonsyntenic chromosome-internal and subtelomeric regions. Retroelements, structural RNAs, and gene family expansion are often associated with syntenic discontinuities that-along with gene divergence, acquisition and loss, and rearrangement within the syntenic regions-have shaped the genomes of each parasite. Contrary to recent reports, our analyses reveal no evidence that these species are descended from an ancestor that contained a photosynthetic endosymbiont.
Asunto(s)
Genoma de Protozoos , Leishmania major/genética , Proteoma , Proteínas Protozoarias/genética , Trypanosoma brucei brucei/genética , Trypanosoma cruzi/genética , Animales , Evolución Biológica , Cromosomas/genética , Evolución Molecular , Transferencia de Gen Horizontal , Genes Protozoarios , Genómica , Leishmania major/química , Leishmania major/metabolismo , Datos de Secuencia Molecular , Familia de Multigenes , Mutación , Filogenia , Plastidios/genética , Proteínas Protozoarias/química , Proteínas Protozoarias/fisiología , Recombinación Genética , Retroelementos , Especificidad de la Especie , Simbiosis , Sintenía , Telómero/genética , Trypanosoma brucei brucei/química , Trypanosoma brucei brucei/metabolismo , Trypanosoma cruzi/química , Trypanosoma cruzi/metabolismoRESUMEN
Whole-genome sequencing of the protozoan pathogen Trypanosoma cruzi revealed that the diploid genome contains a predicted 22,570 proteins encoded by genes, of which 12,570 represent allelic pairs. Over 50% of the genome consists of repeated sequences, such as retrotransposons and genes for large families of surface molecules, which include trans-sialidases, mucins, gp63s, and a large novel family (>1300 copies) of mucin-associated surface protein (MASP) genes. Analyses of the T. cruzi, T. brucei, and Leishmania major (Tritryp) genomes imply differences from other eukaryotes in DNA repair and initiation of replication and reflect their unusual mitochondrial DNA. Although the Tritryp lack several classes of signaling molecules, their kinomes contain a large and diverse set of protein kinases and phosphatases; their size and diversity imply previously unknown interactions and regulatory processes, which may be targets for intervention.
Asunto(s)
Genoma de Protozoos , Proteínas Protozoarias/genética , Análisis de Secuencia de ADN , Trypanosoma cruzi/genética , Animales , Enfermedad de Chagas/tratamiento farmacológico , Enfermedad de Chagas/parasitología , Reparación del ADN , Replicación del ADN , ADN Mitocondrial/genética , ADN Protozoario/genética , Genes Protozoarios , Humanos , Meiosis , Proteínas de la Membrana/química , Proteínas de la Membrana/genética , Proteínas de la Membrana/fisiología , Familia de Multigenes , Proteínas Protozoarias/química , Proteínas Protozoarias/fisiología , Recombinación Genética , Secuencias Repetitivas de Ácidos Nucleicos , Retroelementos , Transducción de Señal , Telómero/genética , Tripanocidas/farmacología , Tripanocidas/uso terapéutico , Trypanosoma cruzi/química , Trypanosoma cruzi/fisiologíaRESUMEN
Leishmania species cause a spectrum of human diseases in tropical and subtropical regions of the world. We have sequenced the 36 chromosomes of the 32.8-megabase haploid genome of Leishmania major (Friedlin strain) and predict 911 RNA genes, 39 pseudogenes, and 8272 protein-coding genes, of which 36% can be ascribed a putative function. These include genes involved in host-pathogen interactions, such as proteolytic enzymes, and extensive machinery for synthesis of complex surface glycoconjugates. The organization of protein-coding genes into long, strand-specific, polycistronic clusters and lack of general transcription factors in the L. major, Trypanosoma brucei, and Trypanosoma cruzi (Tritryp) genomes suggest that the mechanisms regulating RNA polymerase II-directed transcription are distinct from those operating in other eukaryotes, although the trypanosomatids appear capable of chromatin remodeling. Abundant RNA-binding proteins are encoded in the Tritryp genomes, consistent with active posttranscriptional regulation of gene expression.