RESUMO
The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.
Assuntos
Bases de Dados Genéticas , Genoma Humano/genética , Genômica , Pseudogenes/genética , Animais , Biologia Computacional , Humanos , Internet , Camundongos , Anotação de Sequência Molecular , SoftwareRESUMO
INTRODUCTION: Caregivers for people with dementia face a number of challenges such as changing family relationships, social isolation, or financial difficulties. Internet usage and social media are increasingly being recognised as resources to increase support and general public health. OBJECTIVE: Using automated analysis, the aim of this study was to explore (i) the age and sex of people who post to the social media forum Reddit about dementia diagnoses, (ii) the affected person and their diagnosis, (iii) which subreddits authors are posting to, (iv) the types of messages posted, and (v) the content of these posts. METHODS: We analysed Reddit posts concerning dementia diagnoses and used a previously developed text analysis pipeline to determine attributes of the posts and their authors. The posts were further examined through manual annotation of the diagnosis provided and the person affected. Lastly, we investigated the communities posters engage with and assessed the contents of the posts with an automated topic gathering/clustering technique. RESULTS: Five hundred and thirty-five Reddit posts were identified as relevant and further processed. The majority of posters in our dataset are females and predominantly close relatives, such as parents and grandparents, are mentioned. The communities frequented and topics gathered reflect not only the person's diagnosis but also potential outcomes, for example hardships experienced by the caregiver or the requirement for legal support. CONCLUSIONS: This work demonstrates the value of social media data as a resource for in-depth examination of caregivers' experience after a dementia diagnosis. It is important to study groups actively posting online, both in topic-specific and general communities, as they are most likely to benefit from novel internet-based support systems or interventions.
Assuntos
Cuidadores/psicologia , Demência , Intervenção Baseada em Internet/estatística & dados numéricos , Mídias Sociais/estatística & dados numéricos , Apoio Social , Demência/diagnóstico , Demência/economia , Demência/psicologia , Relações Familiares , Estresse Financeiro , Humanos , Isolamento SocialRESUMO
The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.
Assuntos
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Perfilação da Expressão Gênica , Transcriptoma/genética , Animais , Caenorhabditis elegans/embriologia , Caenorhabditis elegans/crescimento & desenvolvimento , Cromatina/genética , Análise por Conglomerados , Drosophila melanogaster/crescimento & desenvolvimento , Regulação da Expressão Gênica no Desenvolvimento/genética , Histonas/metabolismo , Humanos , Larva/genética , Larva/crescimento & desenvolvimento , Modelos Genéticos , Anotação de Sequência Molecular , Regiões Promotoras Genéticas/genética , Pupa/genética , Pupa/crescimento & desenvolvimento , RNA não Traduzido/genética , Análise de Sequência de RNARESUMO
Ensembl (http://www.ensembl.org) creates tools and data resources to facilitate genomic analysis in chordate species with an emphasis on human, major vertebrate model organisms and farm animals. Over the past year we have increased the number of species that we support to 77 and expanded our genome browser with a new scrollable overview and improved variation and phenotype views. We also report updates to our core datasets and improvements to our gene homology relationships from the addition of new species. Our REST service has been extended with additional support for comparative genomics and ontology information. Finally, we provide updated information about our methods for data access and resources for user training.
Assuntos
Bases de Dados Genéticas , Genômica , Animais , Cordados/genética , Variação Genética , Humanos , Internet , Camundongos , Anotação de Sequência Molecular , Fenótipo , RatosRESUMO
The Ensembl project (http://www.ensembl.org) provides genome information for sequenced chordate genomes with a particular focus on human, mouse, zebrafish and rat. Our resources include evidenced-based gene sets for all supported species; large-scale whole genome multiple species alignments across vertebrates and clade-specific alignments for eutherian mammals, primates, birds and fish; variation data resources for 17 species and regulation annotations based on ENCODE and other data sets. Ensembl data are accessible through the genome browser at http://www.ensembl.org and through other tools and programmatic interfaces.
Assuntos
Bases de Dados Genéticas , Genômica , Animais , Regulação da Expressão Gênica , Variação Genética , Humanos , Internet , Camundongos , Anotação de Sequência Molecular , Ratos , Software , Peixe-Zebra/genéticaRESUMO
Genome-wide association studies, DNA sequencing studies, and other genomic studies are finding an increasing number of genetic variants associated with clinical phenotypes that may be useful in developing diagnostic, preventive, and treatment strategies for individual patients. However, few variants have been integrated into routine clinical practice. The reasons for this are several, but two of the most significant are limited evidence about the clinical implications of the variants and a lack of a comprehensive knowledge base that captures genetic variants, their phenotypic associations, and other pertinent phenotypic information that is openly accessible to clinical groups attempting to interpret sequencing data. As the field of medicine begins to incorporate genome-scale analysis into clinical care, approaches need to be developed for collecting and characterizing data on the clinical implications of variants, developing consensus on their actionability, and making this information available for clinical use. The National Human Genome Research Institute (NHGRI) and the Wellcome Trust thus convened a workshop to consider the processes and resources needed to: (1) identify clinically valid genetic variants; (2) decide whether they are actionable and what the action should be; and (3) provide this information for clinical use. This commentary outlines the key discussion points and recommendations from the workshop.
Assuntos
Variação Genética/genética , Informática Médica/métodos , Fenótipo , Medicina de Precisão/métodos , Educação , Humanos , Disseminação de Informação/métodos , National Human Genome Research Institute (U.S.) , Medicina de Precisão/tendências , Estados UnidosRESUMO
The Ensembl project (http://www.ensembl.org) provides genome resources for chordate genomes with a particular focus on human genome data as well as data for key model organisms such as mouse, rat and zebrafish. Five additional species were added in the last year including gibbon (Nomascus leucogenys) and Tasmanian devil (Sarcophilus harrisii) bringing the total number of supported species to 61 as of Ensembl release 64 (September 2011). Of these, 55 species appear on the main Ensembl website and six species are provided on the Ensembl preview site (Pre!Ensembl; http://pre.ensembl.org) with preliminary support. The past year has also seen improvements across the project.
Assuntos
Bases de Dados Genéticas , Genômica , Animais , Regulação da Expressão Gênica , Variação Genética , Humanos , Camundongos , Anotação de Sequência Molecular , RatosRESUMO
The Ensembl project (http://www.ensembl.org) seeks to enable genomic science by providing high quality, integrated annotation on chordate and selected eukaryotic genomes within a consistent and accessible infrastructure. All supported species include comprehensive, evidence-based gene annotations and a selected set of genomes includes additional data focused on variation, comparative, evolutionary, functional and regulatory annotation. The most advanced resources are provided for key species including human, mouse, rat and zebrafish reflecting the popularity and importance of these species in biomedical research. As of Ensembl release 59 (August 2010), 56 species are supported of which 5 have been added in the past year. Since our previous report, we have substantially improved the presentation and integration of both data of disease relevance and the regulatory state of different cell types.
Assuntos
Bases de Dados Genéticas , Genômica , Animais , Variação Genética , Humanos , Camundongos , Anotação de Sequência Molecular , Ratos , Sequências Reguladoras de Ácido Nucleico , Software , Peixe-Zebra/genéticaRESUMO
SUMMARY: Dalliance is a new genome viewer which offers a high level of interactivity while running within a web browser. All data is fetched using the established distributed annotation system (DAS) protocol, making it easy to customize the browser and add extra data. AVAILABILITY AND IMPLEMENTATION: Dalliance runs entirely within your web browser, and relies on existing DAS server infrastructure. Browsers for several mammalian genomes are available at http://www.biodalliance.org/, and the use of DAS means you can add your own data to these browsers. In addition, the source code (Javascript) is available under the BSD license, and is straightforward to install on your own web server and embed within other documents.
Assuntos
Biologia Computacional/métodos , Genômica/métodos , Internet , Software , Redes de Comunicação de Computadores , Gráficos por Computador , Genoma , Linguagens de ProgramaçãoRESUMO
MOTIVATION: Short sequence motifs are an important class of models in molecular biology, used most commonly for describing transcription factor binding site specificity patterns. High-throughput methods have been recently developed for detecting regulatory factor binding sites in vivo and in vitro and consequently high-quality binding site motif data are becoming available for increasing number of organisms and regulatory factors. Development of intuitive tools for the study of sequence motifs is therefore important. iMotifs is a graphical motif analysis environment that allows visualization of annotated sequence motifs and scored motif hits in sequences. It also offers motif inference with the sensitive NestedMICA algorithm, as well as overrepresentation and pairwise motif matching capabilities. All of the analysis functionality is provided without the need to convert between file formats or learn different command line interfaces. The application includes a bundled and graphically integrated version of the NestedMICA motif inference suite that has no outside dependencies. Problems associated with local deployment of software are therefore avoided. AVAILABILITY: iMotifs is licensed with the GNU Lesser General Public License v2.0 (LGPL 2.0). The software and its source is available at http://wiki.github.com/mz2/imotifs and can be run on Mac OS X Leopard (Intel/PowerPC). We also provide a cross-platform (Linux, OS X, Windows) LGPL 2.0 licensed library libxms for the Perl, Ruby, R and Objective-C programming languages for input and output of XMS formatted annotated sequence motif set files. CONTACT: matias.piipari@gmail.com; imotifs@googlegroups.com.
Assuntos
Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Sequência de Bases , Sítios de Ligação , Dados de Sequência Molecular , Alinhamento de Sequência , Interface Usuário-ComputadorRESUMO
The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. The SCOP hierarchy comprises the following levels: Species, Protein, Family, Superfamily, Fold and Class. While keeping the original classification scheme intact, we have changed the production of SCOP in order to cope with a rapid growth of new structural data and to facilitate the discovery of new protein relationships. We describe ongoing developments and new features implemented in SCOP. A new update protocol supports batch classification of new protein structures by their detected relationships at Family and Superfamily levels in contrast to our previous sequential handling of new structural data by release date. We introduce pre-SCOP, a preview of the SCOP developmental version that enables earlier access to the information on new relationships. We also discuss the impact of worldwide Structural Genomics initiatives, which are producing new protein structures at an increasing rate, on the rates of discovery and growth of protein families and superfamilies. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.
Assuntos
Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Proteínas/classificação , Bases de Dados de Proteínas/tendências , Evolução Molecular , Genômica , Internet , Proteínas/genéticaRESUMO
MOTIVATION: The advent of sequencing and structural genomics projects has provided a dramatic boost in the number of uncharacterized protein structures and sequences. Consequently, many computational tools have been developed to help elucidate protein function. However, such services are spread throughout the world, often with standalone web pages. Integration of these methods is needed and so far this has not been possible as there was no common vocabulary available that could be used as a standard language. RESULTS: The Protein Feature Ontology has been developed to provide a structured controlled vocabulary for features on a protein sequence or structure and comprises approximately 100 positional terms, now integrated into the Sequence Ontology (SO) and 40 non-positional terms which describe features relating to the whole-protein sequence. In addition, post-translational modifications are described by using a pre-existing ontology, the Protein Modification Ontology (MOD). This ontology is being used to integrate over 150 distinct annotations provided by the BioSapiens Network of Excellence, a consortium comprising 19 partner sites in Europe. AVAILABILITY: The Protein Feature Ontology can be browsed by accessing the ontology lookup service at the European Bioinformatics Institute (http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=BS).
Assuntos
Biologia Computacional/métodos , Proteínas/química , Software , Vocabulário Controlado , Bases de Dados de Proteínas , Internet , Proteínas/metabolismo , Proteoma/genéticaRESUMO
With the increasing amount of structural data, the number of homologous protein structures bearing topological irregularities is steadily growing. These include proteins with circular permutations, segment-swapping, context-dependent folding or chameleon sequences that can adopt alternative secondary structures. Their non-trivial structural relationships are readily identified during expert analysis but their automatic identification using the existing computational tools still remains difficult or impossible. Such non-trivial cases of protein relationships are known to pose a problem to multiple alignment algorithms and to impede comparative modeling studies. They support a new emerging concept of evolutionary changeable protein fold, which creates practical difficulties for the hierarchical classifications of protein structures.To facilitate the understanding of, and to provide a comprehensive annotation of proteins with such non-trivial structural relationships we have created SISYPHUS ([Sigmaomeganuphiomicronzeta]--in Greek crafty), a compendium to the SCOP database. The SISYPHUS database contains a collection of manually curated structural alignments and their inter-relationships. The multiple alignments are constructed for protein structural regions that range from oligomeric biological units, or individual domains to fragments of different size. The SISYPHUS multiple alignments are displayed with SPICE, a browser that provides an integrated view of protein sequences, structures and their annotations. The database is available from http://sisyphus.mrc-cpe.cam.ac.uk.
Assuntos
Bases de Dados de Proteínas , Homologia Estrutural de Proteína , Internet , Modelos Moleculares , Dobramento de Proteína , Interface Usuário-ComputadorRESUMO
BACKGROUND: The Distributed Annotation System (DAS) is a widely adopted protocol for dynamically integrating a wide range of biological data from geographically diverse sources. DAS continues to expand its applicability and evolve in response to new challenges facing integrative bioinformatics. RESULTS: Here we describe the various infrastructure components of DAS and present a new extended version of the DAS specification. Version 1.53E incorporates several recent developments, including its extension to serve new data types and an ontology for protein features. CONCLUSION: Our extensions to the DAS protocol have facilitated the integration of new data types, and our improvements to the existing DAS infrastructure have addressed recent challenges. The steadily increasing numbers of available data sources demonstrates further adoption of the DAS protocol.
Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas , Armazenamento e Recuperação da Informação/métodos , Biologia Computacional/métodos , Integração de SistemasRESUMO
A key step in understanding gene regulation is to identify the repertoire of transcription factor binding motifs (TFBMs) that form the building blocks of promoters and other regulatory elements. Identifying these experimentally is very laborious, and the number of TFBMs discovered remains relatively small, especially when compared with the hundreds of transcription factor genes predicted in metazoan genomes. We have used a recently developed statistical motif discovery approach, NestedMICA, to detect candidate TFBMs from a large set of Drosophila melanogaster promoter regions. Of the 120 motifs inferred in our initial analysis, 25 were statistically significant matches to previously reported motifs, while 87 appeared to be novel. Analysis of sequence conservation and motif positioning suggested that the great majority of these discovered motifs are predictive of functional elements in the genome. Many motifs showed associations with specific patterns of gene expression in the D. melanogaster embryo, and we were able to obtain confident annotation of expression patterns for 25 of our motifs, including eight of the novel motifs. The motifs are available through Tiffin, a new database of DNA sequence motifs. We have discovered many new motifs that are overrepresented in D. melanogaster promoter regions, and offer several independent lines of evidence that these are novel TFBMs. Our motif dictionary provides a solid foundation for further investigation of regulatory elements in Drosophila, and demonstrates techniques that should be applicable in other species. We suggest that further improvements in computational motif discovery should narrow the gap between the set of known motifs and the total number of transcription factors in metazoan genomes.
Assuntos
Mapeamento Cromossômico/métodos , Proteínas de Drosophila/genética , Drosophila melanogaster/genética , Regiões Promotoras Genéticas/genética , Análise de Sequência de DNA/métodos , Fatores de Transcrição/genética , Motivos de Aminoácidos , Animais , Sequência de Bases , Sítios de Ligação , Drosophila melanogaster/embriologia , Dados de Sequência Molecular , Ligação ProteicaRESUMO
BACKGROUND: The Distributed Annotation System (DAS) is a network protocol for exchanging biological data. It is frequently used to share annotations of genomes and protein sequence. RESULTS: Here we present several extensions to the current DAS 1.5 protocol. These provide new commands to share alignments, three dimensional molecular structure data, add the possibility for registration and discovery of DAS servers, and provide a convention how to provide different types of data plots. We present examples of web sites and applications that use the new extensions. We operate a public registry of DAS sources, which now includes entries for more than 250 distinct sources. CONCLUSION: Our DAS extensions are essential for the management of the growing number of services and exchange of diverse biological data sets. In addition the extensions allow new types of applications to be developed and scientific questions to be addressed. The registry of DAS sources is available at http://www.dasregistry.org.
Assuntos
Biologia Computacional/métodos , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas , Armazenamento e Recuperação da Informação/métodos , Internet , Análise de Sequência/métodos , Interface Usuário-Computador , Algoritmos , Mapeamento Cromossômico/métodos , Integração de SistemasRESUMO
NestedMICA is a new, scalable, pattern-discovery system for finding transcription factor binding sites and similar motifs in biological sequences. Like several previous methods, NestedMICA tackles this problem by optimizing a probabilistic mixture model to fit a set of sequences. However, the use of a newly developed inference strategy called Nested Sampling means NestedMICA is able to find optimal solutions without the need for a problematic initialization or seeding step. We investigate the performance of NestedMICA in a range scenario, on synthetic data and a well-characterized set of muscle regulatory regions, and compare it with the popular MEME program. We show that the new method is significantly more sensitive than MEME: in one case, it successfully extracted a target motif from background sequence four times longer than could be handled by the existing program. It also performs robustly on synthetic sequences containing multiple significant motifs. When tested on a real set of regulatory sequences, NestedMICA produced motifs which were good predictors for all five abundant classes of annotated binding sites.
Assuntos
Biologia Computacional/métodos , Sequências Reguladoras de Ácido Nucleico , Análise de Sequência de DNA/métodos , Software , Fatores de Transcrição/metabolismo , Sequência de Bases , Teorema de Bayes , Sítios de Ligação , Humanos , Cadeias de Markov , Fatores de Regulação Miogênica/metabolismoRESUMO
The UK government has recently recognised the need to improve mental health services in the country. Electronic health records provide a rich source of patient data which could help policymakers to better understand needs of the service users. The main objective of this study is to unveil statistics of diagnoses recorded in the Case Register of the South London and Maudsley NHS Foundation Trust, one of the largest mental health providers in the UK and Europe serving a source population of over 1.2 million people residing in south London. Based on over 500,000 diagnoses recorded in ICD10 codes for a cohort of approximately 200,000 mental health patients, we established frequency rate of each diagnosis (the ratio of the number of patients for whom a diagnosis has ever been recorded to the number of patients in the entire population who have made contact with mental disorders). We also investigated differences in diagnoses prevalence between subgroups of patients stratified by gender and ethnicity. The most common diagnoses in the considered population were (recurrent) depression (ICD10 codes F32-33; 16.4% of patients), reaction to severe stress and adjustment disorders (F43; 7.1%), mental/behavioural disorders due to use of alcohol (F10; 6.9%), and schizophrenia (F20; 5.6%). We also found many diagnoses which were more likely to be recorded in patients of a certain gender or ethnicity. For example, mood (affective) disorders (F31-F39); neurotic, stress-related and somatoform disorders (F40-F48, except F42); and eating disorders (F50) were more likely to be found in records of female patients, while males were more likely to be diagnosed with mental/behavioural disorders due to psychoactive substance use (F10-F19). Furthermore, mental/behavioural disorders due to use of alcohol and opioids were more likely to be recorded in patients of white ethnicity, and disorders due to use of cannabinoids in those of black ethnicity.
Assuntos
Registros Eletrônicos de Saúde/estatística & dados numéricos , Transtornos Mentais/diagnóstico , Saúde Mental/estatística & dados numéricos , Sistema de Registros/estatística & dados numéricos , Feminino , Humanos , MasculinoRESUMO
Neurodegenerative disorders such as Parkinson's and Alzheimer's disease are devastating and costly illnesses, a source of major global burden. In order to provide successful interventions for patients and reduce costs, both causes and pathological processes need to be understood. The ApiNATOMY project aims to contribute to our understanding of neurodegenerative disorders by manually curating and abstracting data from the vast body of literature amassed on these illnesses. As curation is labour-intensive, we aimed to speed up the process by automatically highlighting those parts of the PDF document of primary importance to the curator. Using techniques similar to those of summarisation, we developed an algorithm that relies on linguistic, semantic and spatial features. Employing this algorithm on a test set manually corrected for tool imprecision, we achieved a macro F 1 -measure of 0.51, which is an increase of 132% compared to the best bag-of-words baseline model. A user based evaluation was also conducted to assess the usefulness of the methodology on 40 unseen publications, which reveals that in 85% of cases all highlighted sentences are relevant to the curation task and in about 65% of the cases, the highlights are sufficient to support the knowledge curation task without needing to consult the full text. In conclusion, we believe that these are promising results for a step in automating the recognition of curation-relevant sentences. Refining our approach to pre-digest papers will lead to faster processing and cost reduction in the curation process. Database URL: https://github.com/KHP-Informatics/NapEasy.
Assuntos
Doença de Alzheimer , Curadoria de Dados/métodos , Mineração de Dados/métodos , Doença de Parkinson , Doença de Alzheimer/genética , Doença de Alzheimer/metabolismo , Animais , Curadoria de Dados/normas , Mineração de Dados/normas , Humanos , Doença de Parkinson/genética , Doença de Parkinson/metabolismoRESUMO
The number of people affected by mental illness is on the increase and with it the burden on health and social care use, as well as the loss of both productivity and quality-adjusted life-years. Natural language processing of electronic health records is increasingly used to study mental health conditions and risk behaviours on a large scale. However, narrative notes written by clinicians do not capture first-hand the patients' own experiences, and only record cross-sectional, professional impressions at the point of care. Social media platforms have become a source of 'in the moment' daily exchange, with topics including well-being and mental health. In this study, we analysed posts from the social media platform Reddit and developed classifiers to recognise and classify posts related to mental illness according to 11 disorder themes. Using a neural network and deep learning approach, we could automatically recognise mental illness-related posts in our balenced dataset with an accuracy of 91.08% and select the correct theme with a weighted average accuracy of 71.37%. We believe that these results are a first step in developing methods to characterise large amounts of user-generated content that could support content curation and targeted interventions.