Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 20
Filter
1.
Comput Biol Med ; 177: 108632, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38788373

ABSTRACT

Machine Learning (ML) and Artificial Intelligence (AI) have become an integral part of the drug discovery and development value chain. Many teams in the pharmaceutical industry nevertheless report the challenges associated with the timely, cost effective and meaningful delivery of ML and AI powered solutions for their scientists. We sought to better understand what these challenges were and how to overcome them by performing an industry wide assessment of the practices in AI and Machine Learning. Here we report results of the systematic business analysis of the personas in the modern pharmaceutical discovery enterprise in relation to their work with the AI and ML technologies. We identify 23 common business problems that individuals in these roles face when they encounter AI and ML technologies at work, and describe best practices (Good Machine Learning Practices) that address these issues.


Subject(s)
Drug Discovery , Drug Industry , Machine Learning , Humans , Artificial Intelligence
2.
BMC Bioinformatics ; 22(1): 377, 2021 Jul 21.
Article in English | MEDLINE | ID: mdl-34289807

ABSTRACT

BACKGROUND: Data integration to build a biomedical knowledge graph is a challenging task. There are multiple disease ontologies used in data sources and publications, each having its hierarchy. A common task is to map between ontologies, find disease clusters and finally build a representation of the chosen disease area. There is a shortage of published resources and tools to facilitate interactive, efficient and flexible cross-referencing and analysis of multiple disease ontologies commonly found in data sources and research. RESULTS: Our results are represented as a knowledge graph solution that uses disease ontology cross-references and facilitates switching between ontology hierarchies for data integration and other tasks. CONCLUSIONS: Grakn core with pre-installed "Disease ontologies for knowledge graphs" facilitates the biomedical knowledge graph build and provides an elegant solution for the multiple disease ontologies problem.


Subject(s)
Biological Ontologies , Ethnicity , Humans , Information Storage and Retrieval , Knowledge , Pattern Recognition, Automated
3.
Sci Rep ; 10(1): 21745, 2020 12 10.
Article in English | MEDLINE | ID: mdl-33303834

ABSTRACT

Finding early disease markers using non-invasive and widely available methods is essential to develop a successful therapy for Alzheimer's Disease. Few studies to date have examined urine, the most readily available biofluid. Here we report the largest study to date using comprehensive metabolic phenotyping platforms (NMR spectroscopy and UHPLC-MS) to probe the urinary metabolome in-depth in people with Alzheimer's Disease and Mild Cognitive Impairment. Feature reduction was performed using metabolomic Quantitative Trait Loci, resulting in the list of metabolites associated with the genetic variants. This approach helps accuracy in identification of disease states and provides a route to a plausible mechanistic link to pathological processes. Using these mQTLs we built a Random Forests model, which not only correctly discriminates between people with Alzheimer's Disease and age-matched controls, but also between individuals with Mild Cognitive Impairment who were later diagnosed with Alzheimer's Disease and those who were not. Further annotation of top-ranking metabolic features nominated by the trained model revealed the involvement of cholesterol-derived metabolites and small-molecules that were linked to Alzheimer's pathology in previous studies.


Subject(s)
Alzheimer Disease/genetics , Alzheimer Disease/metabolism , Phenotype , Aged , Aged, 80 and over , Alzheimer Disease/urine , Biomarkers/urine , Cognitive Dysfunction/genetics , Cognitive Dysfunction/metabolism , Cognitive Dysfunction/urine , Female , Humans , Male , Metabolomics/methods , Quantitative Trait Loci
4.
Sci Rep ; 8(1): 13537, 2018 09 10.
Article in English | MEDLINE | ID: mdl-30202034

ABSTRACT

Anaplastic meningioma is a rare and aggressive brain tumor characterised by intractable recurrences and dismal outcomes. Here, we present an integrated analysis of the whole genome, transcriptome and methylation profiles of primary and recurrent anaplastic meningioma. A key finding was the delineation of distinct molecular subgroups that were associated with diametrically opposed survival outcomes. Relative to lower grade meningiomas, anaplastic tumors harbored frequent driver mutations in SWI/SNF complex genes, which were confined to the poor prognosis subgroup. Aggressive disease was further characterised by transcriptional evidence of increased PRC2 activity, stemness and epithelial-to-mesenchymal transition. Our analyses discern biologically distinct variants of anaplastic meningioma with prognostic and therapeutic significance.


Subject(s)
Gene Expression Regulation, Neoplastic , Meningeal Neoplasms/genetics , Meningioma/genetics , Neoplasm Recurrence, Local/genetics , Transcriptome/genetics , Aged , DNA Methylation/genetics , Disease Progression , Female , Gene Expression Profiling , Genomics/methods , Humans , Male , Meningeal Neoplasms/mortality , Meningeal Neoplasms/pathology , Meningeal Neoplasms/surgery , Meningioma/mortality , Meningioma/pathology , Meningioma/surgery , Middle Aged , Neoplasm Grading , Neoplasm Recurrence, Local/mortality , Neoplasm Recurrence, Local/pathology , Prognosis , Survival Analysis , Whole Genome Sequencing
5.
Nat Commun ; 8(1): 886, 2017 10 12.
Article in English | MEDLINE | ID: mdl-29026089

ABSTRACT

The developmental and physiological complexity of the auditory system is likely reflected in the underlying set of genes involved in auditory function. In humans, over 150 non-syndromic loci have been identified, and there are more than 400 human genetic syndromes with a hearing loss component. Over 100 non-syndromic hearing loss genes have been identified in mouse and human, but we remain ignorant of the full extent of the genetic landscape involved in auditory dysfunction. As part of the International Mouse Phenotyping Consortium, we undertook a hearing loss screen in a cohort of 3006 mouse knockout strains. In total, we identify 67 candidate hearing loss genes. We detect known hearing loss genes, but the vast majority, 52, of the candidate genes were novel. Our analysis reveals a large and unexplored genetic landscape involved with auditory function.The full extent of the genetic basis for hearing impairment is unknown. Here, as part of the International Mouse Phenotyping Consortium, the authors perform a hearing loss screen in 3006 mouse knockout strains and identify 52 new candidate genes for genetic hearing loss.


Subject(s)
Hearing Loss/genetics , Protein Interaction Maps/genetics , Animals , Datasets as Topic , Genetic Testing , Hearing Loss/epidemiology , Hearing Tests , Mice , Mice, Knockout , Phenotype
6.
Nat Commun ; 8: 15475, 2017 06 26.
Article in English | MEDLINE | ID: mdl-28650954

ABSTRACT

The role of sex in biomedical studies has often been overlooked, despite evidence of sexually dimorphic effects in some biological studies. Here, we used high-throughput phenotype data from 14,250 wildtype and 40,192 mutant mice (representing 2,186 knockout lines), analysed for up to 234 traits, and found a large proportion of mammalian traits both in wildtype and mutants are influenced by sex. This result has implications for interpreting disease phenotypes in animal models and humans.


Subject(s)
Mammals/physiology , Quantitative Trait, Heritable , Sex Characteristics , Animals , Body Weight , Female , Genes, Modifier , Genotype , Mice , Phenotype
7.
PLoS One ; 10(7): e0131274, 2015.
Article in English | MEDLINE | ID: mdl-26147094

ABSTRACT

The lack of reproducibility with animal phenotyping experiments is a growing concern among the biomedical community. One contributing factor is the inadequate description of statistical analysis methods that prevents researchers from replicating results even when the original data are provided. Here we present PhenStat--a freely available R package that provides a variety of statistical methods for the identification of phenotypic associations. The methods have been developed for high throughput phenotyping pipelines implemented across various experimental designs with an emphasis on managing temporal variation. PhenStat is targeted to two user groups: small-scale users who wish to interact and test data from large resources and large-scale users who require an automated statistical analysis pipeline. The software provides guidance to the user for selecting appropriate analysis methods based on the dataset and is designed to allow for additions and modifications as needed. The package was tested on mouse and rat data and is used by the International Mouse Phenotyping Consortium (IMPC). By providing raw data and the version of PhenStat used, resources like the IMPC give users the ability to replicate and explore results within their own computing environment.


Subject(s)
High-Throughput Screening Assays/standards , Phenotype , Reproducibility of Results , Software , Animals , Datasets as Topic/standards , Datasets as Topic/statistics & numerical data , Female , High-Throughput Screening Assays/methods , High-Throughput Screening Assays/statistics & numerical data , Linear Models , Male , Mice , Rats , Reference Standards
8.
PLoS Biol ; 13(5): e1002151, 2015 May.
Article in English | MEDLINE | ID: mdl-25992600

ABSTRACT

The Animal Research: Reporting of In Vivo Experiments (ARRIVE) guidelines were developed to address the lack of reproducibility in biomedical animal studies and improve the communication of research findings. While intended to guide the preparation of peer-reviewed manuscripts, the principles of transparent reporting are also fundamental for in vivo databases. Here, we describe the benefits and challenges of applying the guidelines for the International Mouse Phenotyping Consortium (IMPC), whose goal is to produce and phenotype 20,000 knockout mouse strains in a reproducible manner across ten research centres. In addition to ensuring the transparency and reproducibility of the IMPC, the solutions to the challenges of applying the ARRIVE guidelines in the context of IMPC will provide a resource to help guide similar initiatives in the future.


Subject(s)
Animal Experimentation/standards , Databases as Topic , Guidelines as Topic , Phenotype , Animals , Mice
9.
Nucleic Acids Res ; 43(Database issue): D1113-6, 2015 Jan.
Article in English | MEDLINE | ID: mdl-25361974

ABSTRACT

The ArrayExpress Archive of Functional Genomics Data (http://www.ebi.ac.uk/arrayexpress) is an international functional genomics database at the European Bioinformatics Institute (EMBL-EBI) recommended by most journals as a repository for data supporting peer-reviewed publications. It contains data from over 7000 public sequencing and 42,000 array-based studies comprising over 1.5 million assays in total. The proportion of sequencing-based submissions has grown significantly over the last few years and has doubled in the last 18 months, whilst the rate of microarray submissions is growing slightly. All data in ArrayExpress are available in the MAGE-TAB format, which allows robust linking to data analysis and visualization tools and standardized analysis. The main development over the last two years has been the release of a new data submission tool Annotare, which has reduced the average submission time almost 3-fold. In the near future, Annotare will become the only submission route into ArrayExpress, alongside MAGE-TAB format-based pipelines. ArrayExpress is a stable and highly accessed resource. Our future tasks include automation of data flows and further integration with other EMBL-EBI resources for the representation of multi-omics data.


Subject(s)
Databases, Genetic , Gene Expression Profiling , Oligonucleotide Array Sequence Analysis , Genomics , High-Throughput Nucleotide Sequencing , Internet , Software
10.
Nucleic Acids Res ; 42(Database issue): D802-9, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24194600

ABSTRACT

The International Mouse Phenotyping Consortium (IMPC) web portal (http://www.mousephenotype.org) provides the biomedical community with a unified point of access to mutant mice and rich collection of related emerging and existing mouse phenotype data. IMPC mouse clinics worldwide follow rigorous highly structured and standardized protocols for the experimentation, collection and dissemination of data. Dedicated 'data wranglers' work with each phenotyping center to collate data and perform quality control of data. An automated statistical analysis pipeline has been developed to identify knockout strains with a significant change in the phenotype parameters. Annotation with biomedical ontologies allows biologists and clinicians to easily find mouse strains with phenotypic traits relevant to their research. Data integration with other resources will provide insights into mammalian gene function and human disease. As phenotype data become available for every gene in the mouse, the IMPC web portal will become an invaluable tool for researchers studying the genetic contributions of genes to human diseases.


Subject(s)
Databases, Genetic , Mice, Knockout , Phenotype , Animals , Biological Ontologies , Internet , Mice
11.
Nature ; 501(7468): 506-11, 2013 Sep 26.
Article in English | MEDLINE | ID: mdl-24037378

ABSTRACT

Genome sequencing projects are discovering millions of genetic variants in humans, and interpretation of their functional effects is essential for understanding the genetic basis of variation in human traits. Here we report sequencing and deep analysis of messenger RNA and microRNA from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project--the first uniformly processed high-throughput RNA-sequencing data from multiple human populations with high-quality genome sequences. We discover extremely widespread genetic variation affecting the regulation of most genes, with transcript structure and expression level variation being equally common but genetically largely independent. Our characterization of causal regulatory variation sheds light on the cellular mechanisms of regulatory and loss-of-function variation, and allows us to infer putative causal variants for dozens of disease-associated loci. Altogether, this study provides a deep understanding of the cellular mechanisms of transcriptome variation and of the landscape of functional variants in the human genome.


Subject(s)
Genetic Variation/genetics , Genome, Human/genetics , High-Throughput Nucleotide Sequencing , Sequence Analysis, RNA , Transcriptome/genetics , Alleles , Cell Line, Transformed , Exons/genetics , Gene Expression Profiling , Humans , Polymorphism, Single Nucleotide/genetics , Quantitative Trait Loci/genetics , RNA, Messenger/analysis , RNA, Messenger/genetics
12.
Nucleic Acids Res ; 41(Database issue): D987-90, 2013 Jan.
Article in English | MEDLINE | ID: mdl-23193272

ABSTRACT

The ArrayExpress Archive of Functional Genomics Data (http://www.ebi.ac.uk/arrayexpress) is one of three international functional genomics public data repositories, alongside the Gene Expression Omnibus at NCBI and the DDBJ Omics Archive, supporting peer-reviewed publications. It accepts data generated by sequencing or array-based technologies and currently contains data from almost a million assays, from over 30 000 experiments. The proportion of sequencing-based submissions has grown significantly over the last 2 years and has reached, in 2012, 15% of all new data. All data are available from ArrayExpress in MAGE-TAB format, which allows robust linking to data analysis and visualization tools, including Bioconductor and GenomeSpace. Additionally, R objects, for microarray data, and binary alignment format files, for sequencing data, have been generated for a significant proportion of ArrayExpress data.


Subject(s)
Databases, Genetic , Genomics , Microarray Analysis , Databases, Genetic/statistics & numerical data , Databases, Genetic/trends , High-Throughput Nucleotide Sequencing , Internet , Software , User-Computer Interface
13.
F1000Res ; 2: 117, 2013.
Article in English | MEDLINE | ID: mdl-24555058

ABSTRACT

IsoCleft Finder is a web-based tool for the detection of local geometric and chemical similarities between potential small-molecule binding cavities and a non-redundant dataset of ligand-bound known small-molecule binding-sites. The non-redundant dataset developed as part of this study is composed of 7339 entries representing unique Pfam/PDB-ligand (hetero group code) combinations with known levels of cognate ligand similarity. The query cavity can be uploaded by the user or detected automatically by the system using existing PDB entries as well as user-provided structures in PDB format. In all cases, the user can refine the definition of the cavity interactively via a browser-based Jmol 3D molecular visualization interface. Furthermore, users can restrict the search to a subset of the dataset using a cognate-similarity threshold. Local structural similarities are detected using the IsoCleft software and ranked according to two criteria (number of atoms in common and Tanimoto score of local structural similarity) and the associated Z-score and p-value measures of statistical significance. The results, including predicted ligands, target proteins, similarity scores, number of atoms in common, etc., are shown in a powerful interactive graphical interface. This interface permits the visualization of target ligands superimposed on the query cavity and additionally provides a table of pairwise ligand topological similarities. Similarities between top scoring ligands serve as an additional tool to judge the quality of the results obtained. We present several examples where IsoCleft Finder provides useful functional information. IsoCleft Finder results are complementary to existing approaches for the prediction of protein function from structure, rational drug design and x-ray crystallography. IsoCleft Finder can be found at: http://bcb.med.usherbrooke.ca/isocleftfinder.

14.
Bioinformatics ; 28(12): 1665-7, 2012 Jun 15.
Article in English | MEDLINE | ID: mdl-22556367

ABSTRACT

MOTIVATIONS: Spreadsheet-like tabular formats are ever more popular in the biomedical field as a mean for experimental reporting. The problem of converting the graph of an experimental workflow into a table-based representation occurs in many such formats and is not easy to solve. RESULTS: We describe graph2tab, a library that implements methods to realise such a conversion in a size-optimised way. Our solution is generic and can be adapted to specific cases of data exporters or data converters that need to be implemented. AVAILABILITY AND IMPLEMENTATION: The library source code and documentation are available at http://github.com/ISA-tools/graph2tab.


Subject(s)
Computer Graphics , Programming Languages , Workflow , Computational Biology/methods , Databases, Factual , Oligonucleotide Array Sequence Analysis
15.
Nucleic Acids Res ; 40(Database issue): D1077-81, 2012 Jan.
Article in English | MEDLINE | ID: mdl-22064864

ABSTRACT

Gene Expression Atlas (http://www.ebi.ac.uk/gxa) is an added-value database providing information about gene expression in different cell types, organism parts, developmental stages, disease states, sample treatments and other biological/experimental conditions. The content of this database derives from curation, re-annotation and statistical analysis of selected data from the ArrayExpress Archive and the European Nucleotide Archive. A simple interface allows the user to query for differential gene expression either by gene names or attributes or by biological conditions, e.g. diseases, organism parts or cell types. Since our previous report we made 20 monthly releases and, as of Release 11.08 (August 2011), the database supports 19 species, which contains expression data measured for 19,014 biological conditions in 136,551 assays from 5598 independent studies.


Subject(s)
Databases, Genetic , Gene Expression Profiling , Oligonucleotide Array Sequence Analysis , Atlases as Topic , Genomics , Humans , MicroRNAs/metabolism , Molecular Sequence Annotation , Sequence Analysis, RNA , User-Computer Interface
16.
Bioinformatics ; 27(17): 2468-70, 2011 Sep 01.
Article in English | MEDLINE | ID: mdl-21697126

ABSTRACT

MOTIVATION: There exist few simple and easily accessible methods to integrate ontologies programmatically in the R environment. We present ontoCAT-an R package to access ontologies in widely used standard formats, stored locally in the filesystem or available online. The ontoCAT package supports a number of traversal and search functions on a single ontology, as well as searching for ontology terms across multiple ontologies and in major ontology repositories. AVAILABILITY: The package and sources are freely available in Bioconductor starting from version 2.8: http://bioconductor.org/help/bioc-views/release/bioc/html/ontoCAT.html or via the OntoCAT website http://www.ontocat.org/wiki/r. CONTACT: natalja@ebi.ac.uk; natalja@ebi.ac.uk.


Subject(s)
Software , Vocabulary, Controlled , Terminology as Topic
17.
BMC Bioinformatics ; 12: 218, 2011 May 29.
Article in English | MEDLINE | ID: mdl-21619703

ABSTRACT

BACKGROUND: Ontologies have become an essential asset in the bioinformatics toolbox and a number of ontology access resources are now available, for example, the EBI Ontology Lookup Service (OLS) and the NCBO BioPortal. However, these resources differ substantially in mode, ease of access, and ontology content. This makes it relatively difficult to access each ontology source separately, map their contents to research data, and much of this effort is being replicated across different research groups. RESULTS: OntoCAT provides a seamless programming interface to query heterogeneous ontology resources including OLS and BioPortal, as well as user-specified local OWL and OBO files. Each resource is wrapped behind easy to learn Java, Bioconductor/R and REST web service commands enabling reuse and integration of ontology software efforts despite variation in technologies. It is also available as a stand-alone MOLGENIS database and a Google App Engine application. CONCLUSIONS: OntoCAT provides a robust, configurable solution for accessing ontology terms specified locally and from remote services, is available as a stand-alone tool and has been tested thoroughly in the ArrayExpress, MOLGENIS, EFO and Gen2Phen phenotype use cases. AVAILABILITY: http://www.ontocat.org.


Subject(s)
Computational Biology/methods , Software , Vocabulary , Databases, Factual , Humans , Programming Languages , User-Computer Interface , Vocabulary, Controlled
18.
Nucleic Acids Res ; 39(Database issue): D1002-4, 2011 Jan.
Article in English | MEDLINE | ID: mdl-21071405

ABSTRACT

The ArrayExpress Archive (http://www.ebi.ac.uk/arrayexpress) is one of the three international public repositories of functional genomics data supporting publications. It includes data generated by sequencing or array-based technologies. Data are submitted by users and imported directly from the NCBI Gene Expression Omnibus. The ArrayExpress Archive is closely integrated with the Gene Expression Atlas and the sequence databases at the European Bioinformatics Institute. Advanced queries provided via ontology enabled interfaces include queries based on technology and sample attributes such as disease, cell types and anatomy.


Subject(s)
Databases, Genetic , Gene Expression Profiling , Genomics , High-Throughput Nucleotide Sequencing , Oligonucleotide Array Sequence Analysis , Gene Expression
19.
Bioinformatics ; 25(20): 2768-9, 2009 Oct 15.
Article in English | MEDLINE | ID: mdl-19633095

ABSTRACT

UNLABELLED: SIMBioMS is a web-based open source software system for managing data and information in biomedical studies. It provides a solution for the collection, storage, management and retrieval of information about research subjects and biomedical samples, as well as experimental data obtained using a range of high-throughput technologies, including gene expression, genotyping, proteomics and metabonomics. The system can easily be customized and has proven to be successful in several large-scale multi-site collaborative projects. It is compatible with emerging functional genomics data standards and provides data import and export in accepted standard formats. Protocols for transferring data to durable archives at the European Bioinformatics Institute have been implemented. AVAILABILITY: The source code, documentation and initialization scripts are available at http://simbioms.org.


Subject(s)
Computational Biology/methods , Database Management Systems , Information Management/methods , Information Storage and Retrieval/methods , Software , Databases, Factual
20.
Bioinformatics ; 24(16): i105-11, 2008 Aug 15.
Article in English | MEDLINE | ID: mdl-18689810

ABSTRACT

MOTIVATION: Current computational methods for the prediction of function from structure are restricted to the detection of similarities and subsequent transfer of functional annotation. In a significant minority of cases, global sequence or structural (fold) similarities do not provide clues about protein function. In these cases, one alternative is to detect local binding site similarities. These may still reflect more distant evolutionary relationships as well as unique physico-chemical constraints necessary for binding similar ligands, thus helping pinpoint the function. In the present work, we ask the following question: is it possible to discriminate within a dataset of non-homologous proteins those that bind similar ligands based on their binding site similarities? METHODS: We implement a graph-matching-based method for the detection of 3D atomic similarities introducing some simplifications that allow us to extend its applicability to the analysis of large allatom binding site models. This method, called IsoCleft, does not require atoms to be connected either in sequence or space. We apply the method to a cognate-ligand bound dataset of non-homologous proteins. We define a family of binding site models with decreasing knowledge about the identity of the ligand-interacting atoms to uncouple the questions of predicting the location of the binding site and detecting binding site similarities. Furthermore, we calculate the individual contributions of binding site size, chemical composition and geometry to prediction performance. RESULTS: We find that it is possible to discriminate between different ligand-binding sites. In other words, there is a certain uniqueness in the set of atoms that are in contact to specific ligand scaffolds. This uniqueness is restricted to the atoms in close proximity of the ligand in which case, size and chemical composition alone are sufficient to discriminate binding sites. Discrimination ability decreases with decreasing knowledge about the identity of the ligand-interacting binding site atoms. The decrease is quite abrupt when considering size and chemical composition alone, but much slower when including geometry. We also observe that certain ligands are easier to discriminate. Interestingly, the subset of binding site atoms belonging to highly conserved residues is not sufficient to discriminate binding sites, implying that convergently evolved binding sites arrived at dissimilar solutions. AVAILABILITY: IsoCleft can be obtained from the authors.


Subject(s)
Algorithms , Models, Chemical , Models, Molecular , Proteins/chemistry , Proteins/ultrastructure , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Amino Acid Sequence , Binding Sites , Computer Simulation , Discriminant Analysis , Molecular Sequence Data , Protein Binding , Protein Conformation , Sequence Homology
SELECTION OF CITATIONS
SEARCH DETAIL
...