Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 29
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
J Cheminform ; 7: 54, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26579214

RESUMO

BACKGROUND: A wide range of chemical compound databases are currently available for pharmaceutical research. To retrieve compound information, including structures, researchers can query these chemical databases using non-systematic identifiers. These are source-dependent identifiers (e.g., brand names, generic names), which are usually assigned to the compound at the point of registration. The correctness of non-systematic identifiers (i.e., whether an identifier matches the associated structure) can only be assessed manually, which is cumbersome, but it is possible to automatically check their ambiguity (i.e., whether an identifier matches more than one structure). In this study we have quantified the ambiguity of non-systematic identifiers within and between eight widely used chemical databases. We also studied the effect of chemical structure standardization on reducing the ambiguity of non-systematic identifiers. RESULTS: The ambiguity of non-systematic identifiers within databases varied from 0.1 to 15.2 % (median 2.5 %). Standardization reduced the ambiguity only to a small extent for most databases. A wide range of ambiguity existed for non-systematic identifiers that are shared between databases (17.7-60.2 %, median of 40.3 %). Removing stereochemistry information provided the largest reduction in ambiguity across databases (median reduction 13.7 percentage points). CONCLUSIONS: Ambiguity of non-systematic identifiers within chemical databases is generally low, but ambiguity of non-systematic identifiers that are shared between databases, is high. Chemical structure standardization reduces the ambiguity to a limited extent. Our findings can help to improve database integration, curation, and maintenance.

2.
PLoS One ; 9(9): e107477, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25268232

RESUMO

Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.


Assuntos
Mineração de Dados/normas , Benchmarking , Curadoria de Dados , Processamento de Linguagem Natural , Patentes como Assunto , Vocabulário Controlado
3.
J Cheminform ; 6(1): 42, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25264458

RESUMO

BACKGROUND: The design of chemical libraries, an early step in agrochemical discovery programs, is frequently addressed by means of qualitative physicochemical and/or topological rule-based methods. The aim of this study is to develop quantitative estimates of herbicide- (QEH), insecticide- (QEI), fungicide- (QEF), and, finally, pesticide-likeness (QEP). In the assessment of these definitions, we relied on the concept of desirability functions. RESULTS: We found a simple function, shared by the three classes of pesticides, parameterized particularly, for six, easy to compute, independent and interpretable, molecular properties: molecular weight, logP, number of hydrogen bond acceptors, number of hydrogen bond donors, number of rotatable bounds and number of aromatic rings. Subsequently, we describe the scoring of each pesticide class by the corresponding quantitative estimate. In a comparative study, we assessed the performance of the scoring functions using extensive datasets of patented pesticides. CONCLUSIONS: The hereby-established quantitative assessment has the ability to rank compounds whether they fail well-established pesticide-likeness rules or not, and offer an efficient way to prioritize (class-specific) pesticides. These findings are valuable for the efficient estimation of pesticide-likeness of vast chemical libraries in the field of agrochemical discovery. Graphical AbstractQuantitative models for pesticide-likeness were derived using the concept of desirability functions parameterized for six, easy to compute, independent and interpretable, molecular properties: molecular weight, logP, number of hydrogen bond acceptors, number of hydrogen bond donors, number of rotatable bounds and number of aromatic rings.

4.
Mol Inform ; 33(5): 332-42, 2014 May.
Artigo em Inglês | MEDLINE | ID: mdl-27485890

RESUMO

In the pharmaceutical industry, efficiently mining pharmacological data from the rapidly increasing scientific literature is very crucial for many aspects of the drug discovery process such as target validation, tool compound selection etc. A quick and reliable way is needed to collect literature assertions of selected compounds' biological and pharmacological effects in order to assist the hypothesis generation and decision-making of drug developers. INFUSIS, the text mining system presented here, extracts data on chemical compounds from PubMed abstracts. It involves an extensive use of customized natural language processing besides a co-occurrence analysis. As a proof-of-concept study, INFUSIS was used to search in abstract texts for several obesity/diabetes related pharmacological effects of the compounds included in a compound dictionary. The system extracts assertions regarding the pharmacological effects of each given compound and scores them by the relevance. For each selected pharmacological effect, the highest scoring assertions in 100 abstracts were manually evaluated, i.e. 800 abstracts in total. The overall accuracy for the inferred assertions was over 90 percent.

5.
Nat Rev Drug Discov ; 12(12): 948-62, 2013 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-24287782

RESUMO

The 'quality' of small-molecule drug candidates, encompassing aspects including their potency, selectivity and ADMET (absorption, distribution, metabolism, excretion and toxicity) characteristics, is a key factor influencing the chances of success in clinical trials. Importantly, such characteristics are under the control of chemists during the identification and optimization of lead compounds. Here, we discuss the application of computational methods, particularly quantitative structure-activity relationships (QSARs), in guiding the selection of higher-quality drug candidates, as well as cultural factors that may have affected their use and impact.


Assuntos
Composição de Medicamentos/normas , Modelos Químicos , Preparações Farmacêuticas/química , Relação Quantitativa Estrutura-Atividade , Animais , Previsões , Humanos
6.
PLoS One ; 8(10): e77142, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24204758

RESUMO

The statistics of drug development output and declining yield of approved medicines has been the subject of many recent reviews. However, assessing research productivity that feeds development is more difficult. Here we utilise an extensive database of structure-activity relationships extracted from papers and patents. We have used this database to analyse published compounds cumulatively linked to nearly 4000 protein target identifiers from multiple species over the last 20 years. The compound output increases up to 2005 followed by a decline that parallels a fall in pharmaceutical patenting. Counts of protein targets have plateaued but not fallen. We extended these results by exploring compounds and targets for one large pharmaceutical company. In addition, we examined collective time course data for six individual protease targets, including average molecular weight of the compounds. We also tracked the PubMed profile of these targets to detect signals related to changes in compound output. Our results show that research compound output had decreased 35% by 2012. The major causative factor is likely to be a contraction in the global research base due to mergers and acquisitions across the pharmaceutical industry. However, this does not rule out an increasing stringency of compound quality filtration and/or patenting cost control. The number of proteins mapped to compounds on a yearly basis shows less decline, indicating the cumulative published target capacity of global research is being sustained in the region of 300 proteins for large companies. The tracking of six individual targets shows uniquely detailed patterns not discernible from cumulative snapshots. These are interpretable in terms of events related to validation and de-risking of targets that produce detectable follow-on surges in patenting. Further analysis of the type we present here can provide unique insights into the process of drug discovery based on the data it actually generates.


Assuntos
Descoberta de Drogas/estatística & dados numéricos , Drogas em Investigação/síntese química , Proteínas/metabolismo , Controle de Custos , Bases de Dados Factuais , Bases de Dados de Produtos Farmacêuticos , Descoberta de Drogas/economia , Descoberta de Drogas/tendências , Indústria Farmacêutica , Drogas em Investigação/farmacologia , Eficiência , Humanos , Patentes como Assunto , Proteínas/agonistas , Proteínas/antagonistas & inibidores , PubMed , Projetos de Pesquisa , Relação Estrutura-Atividade , Fatores de Tempo
7.
Mol Inform ; 32(11-12): 881-897, 2013 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-24533037

RESUMO

ChEMBL, DrugBank, Human Metabolome Database and the Therapeutic Target Database are resources of curated chemistry-to-protein relationships widely used in the chemogenomic arena. In this work we have extended an earlier analysis (PMID 22821596) by comparing chemistry and protein target content between 2010 and 2013. For the former, details are presented for overlaps and differences, statistics of stereochemistry as well as stereo representation and MW profiles between the four databases. For 2013 our results indicate quality improvements, major expansion, increased achiral structures and changes in MW distributions. An orthogonal comparison of chemical content with different sources inside PubChem highlights further interpretable differences. Expansion of protein content by UniProt IDs is also recorded for 2013 and Gene Ontology comparisons for human-only sets indicate differences. These emphasise the expanding complementarity of chemistry-to-protein relationships between sources, although different criteria are used for their capture.

8.
J Cheminform ; 4(1): 35, 2012 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-23237381

RESUMO

BACKGROUND: Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure-property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation. RESULTS: The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%). CONCLUSIONS: We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency.

9.
Methods Mol Biol ; 910: 145-64, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22821596

RESUMO

Databases that provide links between bioactive compounds and their protein targets are increasingly important in drug discovery and chemical biology. They join the expanding universes of cheminformatics via chemical structures on the one hand and bioinformatics via sequences on the other. However, it is difficult to assess the relative utility of databases without the explicit comparison of content. We have exemplified an approach to this by comparing resources that each has a different focus on bioactive chemistry (ChEMBL, DrugBank, Human Metabolome Database, and Therapeutic Target Database) both at the chemical structure and protein levels. We compared the compound sets at different representational stringencies using NCI/CADD Structure Identifiers. The overlap and uniqueness in chemical content can be broadly interpreted in the context of different data capture strategies. However, we recorded apparent anomalies, such as many compounds-in-common between the metabolite and drug databases. We also compared the content of sequences mapped to the compounds via their UniProt protein identifiers. While these were also generally interpretable in the context of individual databases we discerned differences in coverage and the types of supporting data used. For example, the target concept is applied differently between DrugBank and the Therapeutic Target Database. In ChEMBL it encompasses a broader range of mappings from chemical biology and species orthologue cross-screening in addition to drug targets per se. Our analysis should assist users not only in exploiting the synergies between these four high-value resources but also in assessing the utility of other databases at the interface of chemistry and biology.


Assuntos
Biologia Computacional , Bases de Dados de Compostos Químicos , Bases de Dados de Proteínas , Terapia de Alvo Molecular , Humanos , Relação Estrutura-Atividade
10.
J Chem Inf Model ; 52(6): 1480-9, 2012 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-22639789

RESUMO

Patent specifications are one of many information sources needed to progress drug discovery projects. Understanding compound prior art and novelty checking, validation of biological assays, and identification of new starting points for chemical explorations are a few areas where patent analysis is an important component. Cheminformatics methods can be used to facilitate the identification of so-called key compounds in patent specifications. Such methods, relying on structural information extracted from documents by expert curation or text mining, can complement or in some cases replace the traditional manual approach of searching for clues in the text. This paper describes and compares three different methods for the automatic prediction of key compounds in patent specifications using structural information alone. For this data set, the cluster seed analysis described by Hattori et al. (Hattori, K.; Wakabayashi, H.; Tamaki, K. Predicting key example compounds in competitors' patent applications using structural information alone. J. Chem. Inf. Model.2008, 48, 135-142) is superior in terms of prediction accuracy with 26 out of 48 drugs (54%) correctly predicted from their corresponding patents. Nevertheless, the two new methods, based on frequency of R-groups (FOG) and maximum common substructure (MCS) similarity measures, show significant advantages due to their inherent ability to visualize relevant structural features. The results of the FOG method can be enhanced by manual selection of the scaffolds used in the analysis. Finally, a successful example of applying FOG analysis for designing potent ATP-competitive AXL kinase inhibitors with improved properties is described.


Assuntos
Descoberta de Drogas , Estrutura Molecular , Patentes como Assunto
11.
Nat Chem ; 4(2): 90-8, 2012 Jan 24.
Artigo em Inglês | MEDLINE | ID: mdl-22270643

RESUMO

Drug-likeness is a key consideration when selecting compounds during the early stages of drug discovery. However, evaluation of drug-likeness in absolute terms does not reflect adequately the whole spectrum of compound quality. More worryingly, widely used rules may inadvertently foster undesirable molecular property inflation as they permit the encroachment of rule-compliant compounds towards their boundaries. We propose a measure of drug-likeness based on the concept of desirability called the quantitative estimate of drug-likeness (QED). The empirical rationale of QED reflects the underlying distribution of molecular properties. QED is intuitive, transparent, straightforward to implement in many practical settings and allows compounds to be ranked by their relative merit. We extended the utility of QED by applying it to the problem of molecular target druggability assessment by prioritizing a large set of published bioactive compounds. The measure may also capture the abstract notion of aesthetics in medicinal chemistry.


Assuntos
Preparações Farmacêuticas/química , Pesquisa Empírica
12.
Mol Inform ; 31(8): 555-568, 2012 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-23308082

RESUMO

The access and use of large-scale structure-activity relationships (SAR) is increasing as the range of targets and availability of bioactive compound-to-protein mappings expands. However, effective exploitation requires merging and normalisation of activity data, mappings to target classifications as well as visual display of chemical structure relationships. This work describes the development of the application "SARConnect" to address these issues. We discuss options for delivery and analysis of large-scale SAR data together with a set of use-cases to illustrate the design choices and utility. The main activity sources of ChEMBL,1 GOSTAR2 and AstraZeneca's internal system IBIS, had already been integrated in Chemistry Connect.3 For target relationships we selected human UniProtKB/Swiss-Prot4 as our primary source of a heuristic target classification. Similarly, to explore chemical relationships we combined several methods for framework and scaffold analysis into a unified, hierarchical classification where ease of navigation was the primary goal. An application was built on TIBCO Spotfire to retrieve data for visual display. Consequently, users can explore relationships between target, activity and structure across internal, external and commercial sources that encompass approximately 3 million compounds, 2000 human proteins and 10 million activity values. Examples showing the utility of the application are given.

13.
J Chem Inf Model ; 52(1): 51-62, 2012 Jan 23.
Artigo em Inglês | MEDLINE | ID: mdl-22148717

RESUMO

The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.


Assuntos
Biologia Computacional/métodos , Dicionários Químicos como Assunto , Processamento de Linguagem Natural , Software , Mineração de Dados , Bases de Dados Factuais , Patentes como Assunto
14.
Drug Discov Today ; 16(23-24): 1019-30, 2011 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-22024215

RESUMO

The increase in drug research output from patent applications, together with the expansion of public data collections, such as ChEMBL and PubChem BioAssay, has made it essential for pharmaceutical companies to integrate both internal and external 'SAR estate'. The AstraZeneca response has been the development of an enterprise application, Chemistry Connect, containing 45 million unique chemical structures from 18 internal and external data sources. It includes merged compound-to-assay-to-result-to-target relationships extracted from patents, papers and internal data. Users can explore connections between these by searching using drug names or synonyms, chemical structures, patent numbers and target protein identifiers at a scale not previously available.


Assuntos
Bases de Dados Factuais , Descoberta de Drogas/métodos , Preparações Farmacêuticas/química , Farmacologia , Biologia Computacional/métodos , Humanos , Relação Estrutura-Atividade
15.
J Cheminform ; 3(1): 14, 2011 May 13.
Artigo em Inglês | MEDLINE | ID: mdl-21569515

RESUMO

BACKGROUND: Since the classic Hopkins and Groom druggable genome review in 2002, there have been a number of publications updating both the hypothetical and successful human drug target statistics. However, listings of research targets that define the area between these two extremes are sparse because of the challenges of collating published information at the necessary scale. We have addressed this by interrogating databases, populated by expert curation, of bioactivity data extracted from patents and journal papers over the last 30 years. RESULTS: From a subset of just over 27,000 documents we have extracted a set of compound-to-target relationships for biochemical in vitro binding-type assay data for 1,736 human proteins and 1,654 gene identifiers. These are linked to 1,671,951 compound records derived from 823,179 unique chemical structures. The distribution showed a compounds-per-target average of 964 with a maximum of 42,869 (Factor Xa). The list includes non-targets, failed targets and cross-screening targets. The top-278 most actively pursued targets cover 90% of the compounds. We further investigated target ranking by determining the number of molecular frameworks and scaffolds. These were compared to the compound counts as alternative measures of chemical diversity on a per-target basis. CONCLUSIONS: The compounds-per-protein listing generated in this work (provided as a supplementary file) represents the major proportion of the human drug target landscape defined by published data. We supplemented the simple ranking by the number of compounds assayed with additional rankings by molecular topology. These showed significant differences and provide complementary assessments of chemical tractability.

16.
J Med Chem ; 53(21): 7709-14, 2010 Nov 11.
Artigo em Inglês | MEDLINE | ID: mdl-20942392

RESUMO

There is a strong interest in drug discovery and development to advance the understanding of pharmacological promiscuity. Improved understanding of how a molecular structure is related to promiscuity could help to reduce the attrition of compounds in the drug discovery process. For this purpose, a descriptor is introduced that describes the structural complexity of a compound based on the size of its molecular framework (MF) in relation to its overall size. It is defined as the fraction of the size of the molecular framework versus the size of the whole molecule (f(MF)). It is demonstrated that promiscuity correlates with f(MF) for large f(MF) values. The observed correlation is not due to lipophilicity. To provide further explanation of this observation, it was found that the number of terminal ring systems in a compound is correlated with promiscuity. The analysis presented here might help medicinal chemists to improve the selectivity for compounds in drug discovery projects.


Assuntos
Estrutura Molecular , Preparações Farmacêuticas/química , Bases de Dados Factuais , Descoberta de Drogas , Relação Quantitativa Estrutura-Atividade
17.
Bioorg Med Chem Lett ; 19(24): 6943-7, 2009 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-19879759

RESUMO

We performed a comparison of several simple physicochemical properties between marketed drugs, clinical candidates and bioactive compounds using commercially available databases (GVKBIO, Hyderabad, India). In contrast to previous studies this comparison was performed at the individual target level. Confirming earlier studies this shows that marketed drugs have, on average and taken as a single set, lower physicochemical property values than the corresponding clinical candidates and bioactive compounds but that there is considerable variation between drug targets. This work complements earlier studies by using a much larger annotated dataset and confirms that there is a shift in physicochemical properties for targets with launched drugs and clinical candidates compared to bioactive compounds.


Assuntos
Produtos Biológicos/química , Marketing , Bases de Dados Factuais , Avaliação Pré-Clínica de Medicamentos
18.
J Med Chem ; 52(7): 1953-62, 2009 Apr 09.
Artigo em Inglês | MEDLINE | ID: mdl-19265440

RESUMO

Natural products (NPs) are a rich source of novel compound classes and new drugs. In the present study we have used the chemical space navigation tool ChemGPS-NP to evaluate the chemical space occupancy by NPs and bioactive medicinal chemistry compounds from the database WOMBAT. The two sets differ notably in coverage of chemical space, and tangible leadlike NPs were found to cover regions of chemical space that lack representation in WOMBAT. Property based similarity calculations were performed to identify NP neighbors of approved drugs. Several of the NPs revealed by this method were confirmed to exhibit the same activity as their drug neighbors. The identification of leads from a NP starting point may prove a useful strategy for drug discovery in the search for novel leads with unique properties.


Assuntos
Produtos Biológicos , Descoberta de Drogas/métodos , Preparações Farmacêuticas , Produtos Biológicos/química , Gráficos por Computador , Bases de Dados Factuais , Desenho de Fármacos , Estrutura Molecular , Preparações Farmacêuticas/química , Relação Estrutura-Atividade
19.
J Comput Aided Mol Des ; 23(4): 253-9, 2009 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-19082743

RESUMO

Internet has become a central source for information, tools, and services facilitating the work for medicinal chemists and drug discoverers worldwide. In this paper we introduce a web-based public tool, ChemGPS-NP(Web) (http://chemgps.bmc.uu.se), for comprehensive chemical space navigation and exploration in terms of global mapping onto a consistent, eight dimensional map over structure derived physico-chemical characteristics. ChemGPS-NP(Web) can assist in compound selection and prioritization; property description and interpretation; cluster analysis and neighbourhood mapping; as well as comparison and characterization of large compound datasets. By using ChemGPS-NP(Web), researchers can analyze and compare chemical libraries in a consistent manner. In this study it is demonstrated how ChemGPS-NP(Web) can assist in interpreting results from two large datasets tested for activity in biological assays for pyruvate kinase and Bcl-2 family related protein interactions, respectively. Furthermore, a more than 30-year-old suggestion of "chemical similarity" between the natural pigments betalains and muscaflavins is tested.


Assuntos
Biologia Computacional/métodos , Descoberta de Drogas/métodos , Internet , Modelos Moleculares , Software , Antineoplásicos/química , Antineoplásicos/farmacologia , Proteínas Reguladoras de Apoptose/antagonistas & inibidores , Proteínas Reguladoras de Apoptose/química , Proteínas Reguladoras de Apoptose/metabolismo , Proteína 11 Semelhante a Bcl-2 , Betalaínas/química , Bases de Dados Factuais , Inibidores Enzimáticos/química , Flavinas/química , Humanos , Proteínas de Membrana/antagonistas & inibidores , Proteínas de Membrana/química , Proteínas de Membrana/metabolismo , Ligação Proteica/efeitos dos fármacos , Proteínas Proto-Oncogênicas/antagonistas & inibidores , Proteínas Proto-Oncogênicas/química , Proteínas Proto-Oncogênicas/metabolismo , Proteínas Proto-Oncogênicas c-bcl-2/antagonistas & inibidores , Proteínas Proto-Oncogênicas c-bcl-2/química , Proteínas Proto-Oncogênicas c-bcl-2/metabolismo , Piruvato Quinase/antagonistas & inibidores , Design de Software , Interface Usuário-Computador
20.
J Cheminform ; 1(1): 10, 2009 Jul 06.
Artigo em Inglês | MEDLINE | ID: mdl-20298516

RESUMO

BACKGROUND: Since 2004 public cheminformatic databases and their collective functionality for exploring relationships between compounds, protein sequences, literature and assay data have advanced dramatically. In parallel, commercial sources that extract and curate such relationships from journals and patents have also been expanding. This work updates a previous comparative study of databases chosen because of their bioactive content, availability of downloads and facility to select informative subsets. RESULTS: Where they could be calculated, extracted compounds-per-journal article were in the range of 12 to 19 but compound-per-protein counts increased with document numbers. Chemical structure filtration to facilitate standardised comparisons typically reduced source counts by between 5% and 30%. The pair-wise overlaps between 23 databases and subsets were determined, as well as changes between 2006 and 2008. While all compound sets have increased, PubChem has doubled to 14.2 million. The 2008 comparison matrix shows not only overlap but also unique content across all sources. Many of the detailed differences could be attributed to individual strategies for data selection and extraction. While there was a big increase in patent-derived structures entering PubChem since 2006, GVKBIO contains over 0.8 million unique structures from this source. Venn diagrams showed extensive overlap between compounds extracted by independent expert curation from journals by GVKBIO, WOMBAT (both commercial) and BindingDB (public) but each included unique content. In contrast, the approved drug collections from GVKBIO, MDDR (commercial) and DrugBank (public) showed surprisingly low overlap. Aggregating all commercial sources established that while 1 million compounds overlapped with PubChem 1.2 million did not. CONCLUSION: On the basis of chemical structure content per se public sources have covered an increasing proportion of commercial databases over the last two years. However, commercial products included in this study provide links between compounds and information from patents and journals at a larger scale than current public efforts. They also continue to capture a significant proportion of unique content. Our results thus demonstrate not only an encouraging overall expansion of data-supported bioactive chemical space but also that both commercial and public sources are complementary for its exploration.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA