RESUMO
Chemical substructure search is a critical task in medicinal chemistry and small-molecule drug discovery, enabling the retrieval of molecules from databases based on specific chemical features. While systems exist for this purpose, the challenge of efficient and swift searching persists, particularly as data storage migrates to the cloud, introducing new complexities. This study provides a comprehensive analysis of chemical substructure searches, showcasing the benefits of graphics processing unit-accelerated fingerprint screening. The research highlights strategies for optimizing performance, making significant advancements in substructure searching, a pivotal aspect of drug discovery and molecular research. The accessible and scalable nature of the proposed approach makes it a valuable resource for scientists aiming to enhance their substructure search capabilities.
Assuntos
Gráficos por Computador , Descoberta de Drogas , Descoberta de Drogas/métodos , Bases de Dados de Compostos Químicos , SoftwareRESUMO
BACKGROUND: Data integration to build a biomedical knowledge graph is a challenging task. There are multiple disease ontologies used in data sources and publications, each having its hierarchy. A common task is to map between ontologies, find disease clusters and finally build a representation of the chosen disease area. There is a shortage of published resources and tools to facilitate interactive, efficient and flexible cross-referencing and analysis of multiple disease ontologies commonly found in data sources and research. RESULTS: Our results are represented as a knowledge graph solution that uses disease ontology cross-references and facilitates switching between ontology hierarchies for data integration and other tasks. CONCLUSIONS: Grakn core with pre-installed "Disease ontologies for knowledge graphs" facilitates the biomedical knowledge graph build and provides an elegant solution for the multiple disease ontologies problem.
Assuntos
Ontologias Biológicas , Etnicidade , Humanos , Armazenamento e Recuperação da Informação , Conhecimento , Reconhecimento Automatizado de PadrãoRESUMO
Genome sequencing projects are discovering millions of genetic variants in humans, and interpretation of their functional effects is essential for understanding the genetic basis of variation in human traits. Here we report sequencing and deep analysis of messenger RNA and microRNA from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project--the first uniformly processed high-throughput RNA-sequencing data from multiple human populations with high-quality genome sequences. We discover extremely widespread genetic variation affecting the regulation of most genes, with transcript structure and expression level variation being equally common but genetically largely independent. Our characterization of causal regulatory variation sheds light on the cellular mechanisms of regulatory and loss-of-function variation, and allows us to infer putative causal variants for dozens of disease-associated loci. Altogether, this study provides a deep understanding of the cellular mechanisms of transcriptome variation and of the landscape of functional variants in the human genome.
Assuntos
Variação Genética/genética , Genoma Humano/genética , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de RNA , Transcriptoma/genética , Alelos , Linhagem Celular Transformada , Éxons/genética , Perfilação da Expressão Gênica , Humanos , Polimorfismo de Nucleotídeo Único/genética , Locos de Características Quantitativas/genética , RNA Mensageiro/análise , RNA Mensageiro/genéticaRESUMO
The Animal Research: Reporting of In Vivo Experiments (ARRIVE) guidelines were developed to address the lack of reproducibility in biomedical animal studies and improve the communication of research findings. While intended to guide the preparation of peer-reviewed manuscripts, the principles of transparent reporting are also fundamental for in vivo databases. Here, we describe the benefits and challenges of applying the guidelines for the International Mouse Phenotyping Consortium (IMPC), whose goal is to produce and phenotype 20,000 knockout mouse strains in a reproducible manner across ten research centres. In addition to ensuring the transparency and reproducibility of the IMPC, the solutions to the challenges of applying the ARRIVE guidelines in the context of IMPC will provide a resource to help guide similar initiatives in the future.
Assuntos
Experimentação Animal/normas , Bases de Dados como Assunto , Guias como Assunto , Fenótipo , Animais , CamundongosRESUMO
The ArrayExpress Archive of Functional Genomics Data (http://www.ebi.ac.uk/arrayexpress) is an international functional genomics database at the European Bioinformatics Institute (EMBL-EBI) recommended by most journals as a repository for data supporting peer-reviewed publications. It contains data from over 7000 public sequencing and 42,000 array-based studies comprising over 1.5 million assays in total. The proportion of sequencing-based submissions has grown significantly over the last few years and has doubled in the last 18 months, whilst the rate of microarray submissions is growing slightly. All data in ArrayExpress are available in the MAGE-TAB format, which allows robust linking to data analysis and visualization tools and standardized analysis. The main development over the last two years has been the release of a new data submission tool Annotare, which has reduced the average submission time almost 3-fold. In the near future, Annotare will become the only submission route into ArrayExpress, alongside MAGE-TAB format-based pipelines. ArrayExpress is a stable and highly accessed resource. Our future tasks include automation of data flows and further integration with other EMBL-EBI resources for the representation of multi-omics data.
Assuntos
Bases de Dados Genéticas , Perfilação da Expressão Gênica , Análise de Sequência com Séries de Oligonucleotídeos , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Internet , SoftwareRESUMO
The International Mouse Phenotyping Consortium (IMPC) web portal (http://www.mousephenotype.org) provides the biomedical community with a unified point of access to mutant mice and rich collection of related emerging and existing mouse phenotype data. IMPC mouse clinics worldwide follow rigorous highly structured and standardized protocols for the experimentation, collection and dissemination of data. Dedicated 'data wranglers' work with each phenotyping center to collate data and perform quality control of data. An automated statistical analysis pipeline has been developed to identify knockout strains with a significant change in the phenotype parameters. Annotation with biomedical ontologies allows biologists and clinicians to easily find mouse strains with phenotypic traits relevant to their research. Data integration with other resources will provide insights into mammalian gene function and human disease. As phenotype data become available for every gene in the mouse, the IMPC web portal will become an invaluable tool for researchers studying the genetic contributions of genes to human diseases.
Assuntos
Bases de Dados Genéticas , Camundongos Knockout , Fenótipo , Animais , Ontologias Biológicas , Internet , CamundongosRESUMO
The ArrayExpress Archive of Functional Genomics Data (http://www.ebi.ac.uk/arrayexpress) is one of three international functional genomics public data repositories, alongside the Gene Expression Omnibus at NCBI and the DDBJ Omics Archive, supporting peer-reviewed publications. It accepts data generated by sequencing or array-based technologies and currently contains data from almost a million assays, from over 30 000 experiments. The proportion of sequencing-based submissions has grown significantly over the last 2 years and has reached, in 2012, 15% of all new data. All data are available from ArrayExpress in MAGE-TAB format, which allows robust linking to data analysis and visualization tools, including Bioconductor and GenomeSpace. Additionally, R objects, for microarray data, and binary alignment format files, for sequencing data, have been generated for a significant proportion of ArrayExpress data.
Assuntos
Bases de Dados Genéticas , Genômica , Análise em Microsséries , Bases de Dados Genéticas/estatística & dados numéricos , Bases de Dados Genéticas/tendências , Sequenciamento de Nucleotídeos em Larga Escala , Internet , Software , Interface Usuário-ComputadorRESUMO
Gene Expression Atlas (http://www.ebi.ac.uk/gxa) is an added-value database providing information about gene expression in different cell types, organism parts, developmental stages, disease states, sample treatments and other biological/experimental conditions. The content of this database derives from curation, re-annotation and statistical analysis of selected data from the ArrayExpress Archive and the European Nucleotide Archive. A simple interface allows the user to query for differential gene expression either by gene names or attributes or by biological conditions, e.g. diseases, organism parts or cell types. Since our previous report we made 20 monthly releases and, as of Release 11.08 (August 2011), the database supports 19 species, which contains expression data measured for 19,014 biological conditions in 136,551 assays from 5598 independent studies.
Assuntos
Bases de Dados Genéticas , Perfilação da Expressão Gênica , Análise de Sequência com Séries de Oligonucleotídeos , Atlas como Assunto , Genômica , Humanos , MicroRNAs/metabolismo , Anotação de Sequência Molecular , Análise de Sequência de RNA , Interface Usuário-ComputadorRESUMO
Machine Learning (ML) and Artificial Intelligence (AI) have become an integral part of the drug discovery and development value chain. Many teams in the pharmaceutical industry nevertheless report the challenges associated with the timely, cost effective and meaningful delivery of ML and AI powered solutions for their scientists. We sought to better understand what these challenges were and how to overcome them by performing an industry wide assessment of the practices in AI and Machine Learning. Here we report results of the systematic business analysis of the personas in the modern pharmaceutical discovery enterprise in relation to their work with the AI and ML technologies. We identify 23 common business problems that individuals in these roles face when they encounter AI and ML technologies at work, and describe best practices (Good Machine Learning Practices) that address these issues.
Assuntos
Descoberta de Drogas , Indústria Farmacêutica , Aprendizado de Máquina , Humanos , Inteligência ArtificialRESUMO
MOTIVATIONS: Spreadsheet-like tabular formats are ever more popular in the biomedical field as a mean for experimental reporting. The problem of converting the graph of an experimental workflow into a table-based representation occurs in many such formats and is not easy to solve. RESULTS: We describe graph2tab, a library that implements methods to realise such a conversion in a size-optimised way. Our solution is generic and can be adapted to specific cases of data exporters or data converters that need to be implemented. AVAILABILITY AND IMPLEMENTATION: The library source code and documentation are available at http://github.com/ISA-tools/graph2tab.
Assuntos
Gráficos por Computador , Linguagens de Programação , Fluxo de Trabalho , Biologia Computacional/métodos , Bases de Dados Factuais , Análise de Sequência com Séries de OligonucleotídeosRESUMO
The ArrayExpress Archive (http://www.ebi.ac.uk/arrayexpress) is one of the three international public repositories of functional genomics data supporting publications. It includes data generated by sequencing or array-based technologies. Data are submitted by users and imported directly from the NCBI Gene Expression Omnibus. The ArrayExpress Archive is closely integrated with the Gene Expression Atlas and the sequence databases at the European Bioinformatics Institute. Advanced queries provided via ontology enabled interfaces include queries based on technology and sample attributes such as disease, cell types and anatomy.
Assuntos
Bases de Dados Genéticas , Perfilação da Expressão Gênica , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência com Séries de Oligonucleotídeos , Expressão GênicaRESUMO
MOTIVATION: There exist few simple and easily accessible methods to integrate ontologies programmatically in the R environment. We present ontoCAT-an R package to access ontologies in widely used standard formats, stored locally in the filesystem or available online. The ontoCAT package supports a number of traversal and search functions on a single ontology, as well as searching for ontology terms across multiple ontologies and in major ontology repositories. AVAILABILITY: The package and sources are freely available in Bioconductor starting from version 2.8: http://bioconductor.org/help/bioc-views/release/bioc/html/ontoCAT.html or via the OntoCAT website http://www.ontocat.org/wiki/r. CONTACT: natalja@ebi.ac.uk; natalja@ebi.ac.uk.
Assuntos
Software , Vocabulário Controlado , Terminologia como AssuntoRESUMO
BACKGROUND: Ontologies have become an essential asset in the bioinformatics toolbox and a number of ontology access resources are now available, for example, the EBI Ontology Lookup Service (OLS) and the NCBO BioPortal. However, these resources differ substantially in mode, ease of access, and ontology content. This makes it relatively difficult to access each ontology source separately, map their contents to research data, and much of this effort is being replicated across different research groups. RESULTS: OntoCAT provides a seamless programming interface to query heterogeneous ontology resources including OLS and BioPortal, as well as user-specified local OWL and OBO files. Each resource is wrapped behind easy to learn Java, Bioconductor/R and REST web service commands enabling reuse and integration of ontology software efforts despite variation in technologies. It is also available as a stand-alone MOLGENIS database and a Google App Engine application. CONCLUSIONS: OntoCAT provides a robust, configurable solution for accessing ontology terms specified locally and from remote services, is available as a stand-alone tool and has been tested thoroughly in the ArrayExpress, MOLGENIS, EFO and Gen2Phen phenotype use cases. AVAILABILITY: http://www.ontocat.org.
Assuntos
Biologia Computacional/métodos , Software , Vocabulário , Bases de Dados Factuais , Humanos , Linguagens de Programação , Interface Usuário-Computador , Vocabulário ControladoRESUMO
UNLABELLED: SIMBioMS is a web-based open source software system for managing data and information in biomedical studies. It provides a solution for the collection, storage, management and retrieval of information about research subjects and biomedical samples, as well as experimental data obtained using a range of high-throughput technologies, including gene expression, genotyping, proteomics and metabonomics. The system can easily be customized and has proven to be successful in several large-scale multi-site collaborative projects. It is compatible with emerging functional genomics data standards and provides data import and export in accepted standard formats. Protocols for transferring data to durable archives at the European Bioinformatics Institute have been implemented. AVAILABILITY: The source code, documentation and initialization scripts are available at http://simbioms.org.
Assuntos
Biologia Computacional/métodos , Sistemas de Gerenciamento de Base de Dados , Gestão da Informação/métodos , Armazenamento e Recuperação da Informação/métodos , Software , Bases de Dados FactuaisRESUMO
Finding early disease markers using non-invasive and widely available methods is essential to develop a successful therapy for Alzheimer's Disease. Few studies to date have examined urine, the most readily available biofluid. Here we report the largest study to date using comprehensive metabolic phenotyping platforms (NMR spectroscopy and UHPLC-MS) to probe the urinary metabolome in-depth in people with Alzheimer's Disease and Mild Cognitive Impairment. Feature reduction was performed using metabolomic Quantitative Trait Loci, resulting in the list of metabolites associated with the genetic variants. This approach helps accuracy in identification of disease states and provides a route to a plausible mechanistic link to pathological processes. Using these mQTLs we built a Random Forests model, which not only correctly discriminates between people with Alzheimer's Disease and age-matched controls, but also between individuals with Mild Cognitive Impairment who were later diagnosed with Alzheimer's Disease and those who were not. Further annotation of top-ranking metabolic features nominated by the trained model revealed the involvement of cholesterol-derived metabolites and small-molecules that were linked to Alzheimer's pathology in previous studies.
Assuntos
Doença de Alzheimer/genética , Doença de Alzheimer/metabolismo , Fenótipo , Idoso , Idoso de 80 Anos ou mais , Doença de Alzheimer/urina , Biomarcadores/urina , Disfunção Cognitiva/genética , Disfunção Cognitiva/metabolismo , Disfunção Cognitiva/urina , Feminino , Humanos , Masculino , Metabolômica/métodos , Locos de Características QuantitativasRESUMO
MOTIVATION: Current computational methods for the prediction of function from structure are restricted to the detection of similarities and subsequent transfer of functional annotation. In a significant minority of cases, global sequence or structural (fold) similarities do not provide clues about protein function. In these cases, one alternative is to detect local binding site similarities. These may still reflect more distant evolutionary relationships as well as unique physico-chemical constraints necessary for binding similar ligands, thus helping pinpoint the function. In the present work, we ask the following question: is it possible to discriminate within a dataset of non-homologous proteins those that bind similar ligands based on their binding site similarities? METHODS: We implement a graph-matching-based method for the detection of 3D atomic similarities introducing some simplifications that allow us to extend its applicability to the analysis of large allatom binding site models. This method, called IsoCleft, does not require atoms to be connected either in sequence or space. We apply the method to a cognate-ligand bound dataset of non-homologous proteins. We define a family of binding site models with decreasing knowledge about the identity of the ligand-interacting atoms to uncouple the questions of predicting the location of the binding site and detecting binding site similarities. Furthermore, we calculate the individual contributions of binding site size, chemical composition and geometry to prediction performance. RESULTS: We find that it is possible to discriminate between different ligand-binding sites. In other words, there is a certain uniqueness in the set of atoms that are in contact to specific ligand scaffolds. This uniqueness is restricted to the atoms in close proximity of the ligand in which case, size and chemical composition alone are sufficient to discriminate binding sites. Discrimination ability decreases with decreasing knowledge about the identity of the ligand-interacting binding site atoms. The decrease is quite abrupt when considering size and chemical composition alone, but much slower when including geometry. We also observe that certain ligands are easier to discriminate. Interestingly, the subset of binding site atoms belonging to highly conserved residues is not sufficient to discriminate binding sites, implying that convergently evolved binding sites arrived at dissimilar solutions. AVAILABILITY: IsoCleft can be obtained from the authors.
Assuntos
Algoritmos , Modelos Químicos , Modelos Moleculares , Proteínas/química , Proteínas/ultraestrutura , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Sítios de Ligação , Simulação por Computador , Análise Discriminante , Dados de Sequência Molecular , Ligação Proteica , Conformação Proteica , Homologia de SequênciaRESUMO
Anaplastic meningioma is a rare and aggressive brain tumor characterised by intractable recurrences and dismal outcomes. Here, we present an integrated analysis of the whole genome, transcriptome and methylation profiles of primary and recurrent anaplastic meningioma. A key finding was the delineation of distinct molecular subgroups that were associated with diametrically opposed survival outcomes. Relative to lower grade meningiomas, anaplastic tumors harbored frequent driver mutations in SWI/SNF complex genes, which were confined to the poor prognosis subgroup. Aggressive disease was further characterised by transcriptional evidence of increased PRC2 activity, stemness and epithelial-to-mesenchymal transition. Our analyses discern biologically distinct variants of anaplastic meningioma with prognostic and therapeutic significance.
Assuntos
Regulação Neoplásica da Expressão Gênica , Neoplasias Meníngeas/genética , Meningioma/genética , Recidiva Local de Neoplasia/genética , Transcriptoma/genética , Idoso , Metilação de DNA/genética , Progressão da Doença , Feminino , Perfilação da Expressão Gênica , Genômica/métodos , Humanos , Masculino , Neoplasias Meníngeas/mortalidade , Neoplasias Meníngeas/patologia , Neoplasias Meníngeas/cirurgia , Meningioma/mortalidade , Meningioma/patologia , Meningioma/cirurgia , Pessoa de Meia-Idade , Gradação de Tumores , Recidiva Local de Neoplasia/mortalidade , Recidiva Local de Neoplasia/patologia , Prognóstico , Análise de Sobrevida , Sequenciamento Completo do GenomaRESUMO
The role of sex in biomedical studies has often been overlooked, despite evidence of sexually dimorphic effects in some biological studies. Here, we used high-throughput phenotype data from 14,250 wildtype and 40,192 mutant mice (representing 2,186 knockout lines), analysed for up to 234 traits, and found a large proportion of mammalian traits both in wildtype and mutants are influenced by sex. This result has implications for interpreting disease phenotypes in animal models and humans.
Assuntos
Mamíferos/fisiologia , Característica Quantitativa Herdável , Caracteres Sexuais , Animais , Peso Corporal , Feminino , Genes Modificadores , Genótipo , Camundongos , FenótipoRESUMO
The developmental and physiological complexity of the auditory system is likely reflected in the underlying set of genes involved in auditory function. In humans, over 150 non-syndromic loci have been identified, and there are more than 400 human genetic syndromes with a hearing loss component. Over 100 non-syndromic hearing loss genes have been identified in mouse and human, but we remain ignorant of the full extent of the genetic landscape involved in auditory dysfunction. As part of the International Mouse Phenotyping Consortium, we undertook a hearing loss screen in a cohort of 3006 mouse knockout strains. In total, we identify 67 candidate hearing loss genes. We detect known hearing loss genes, but the vast majority, 52, of the candidate genes were novel. Our analysis reveals a large and unexplored genetic landscape involved with auditory function.The full extent of the genetic basis for hearing impairment is unknown. Here, as part of the International Mouse Phenotyping Consortium, the authors perform a hearing loss screen in 3006 mouse knockout strains and identify 52 new candidate genes for genetic hearing loss.
Assuntos
Perda Auditiva/genética , Mapas de Interação de Proteínas/genética , Animais , Conjuntos de Dados como Assunto , Testes Genéticos , Perda Auditiva/epidemiologia , Testes Auditivos , Camundongos , Camundongos Knockout , FenótipoRESUMO
The lack of reproducibility with animal phenotyping experiments is a growing concern among the biomedical community. One contributing factor is the inadequate description of statistical analysis methods that prevents researchers from replicating results even when the original data are provided. Here we present PhenStat--a freely available R package that provides a variety of statistical methods for the identification of phenotypic associations. The methods have been developed for high throughput phenotyping pipelines implemented across various experimental designs with an emphasis on managing temporal variation. PhenStat is targeted to two user groups: small-scale users who wish to interact and test data from large resources and large-scale users who require an automated statistical analysis pipeline. The software provides guidance to the user for selecting appropriate analysis methods based on the dataset and is designed to allow for additions and modifications as needed. The package was tested on mouse and rat data and is used by the International Mouse Phenotyping Consortium (IMPC). By providing raw data and the version of PhenStat used, resources like the IMPC give users the ability to replicate and explore results within their own computing environment.