Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 19 de 19
2.
Methods Mol Biol ; 2212: 17-35, 2021.
Article En | MEDLINE | ID: mdl-33733347

We present SNPInt-GPU, a software providing several methods for statistical epistasis testing. SNPInt-GPU supports GPU acceleration using the Nvidia CUDA framework, but can also be used without GPU hardware. The software implements logistic regression (as in PLINK epistasis testing), BOOST, log-linear regression, mutual information (MI), and information gain (IG) for pairwise testing as well as mutual information and information gain for third-order tests. Optionally, r2 scores for testing for linkage disequilibrium (LD) can be calculated on-the-fly. SNPInt-GPU is publicly available at GitHub. The software requires a Linux-based operating system and CUDA libraries. This chapter describes detailed installation and usage instructions as well as examples for basic preliminary quality control and analysis of results.


Algorithms , Data Curation/statistics & numerical data , Epistasis, Genetic , Software , Entropy , Humans , Linkage Disequilibrium , Logistic Models , Quality Control
3.
Nucleic Acids Res ; 49(D1): D1507-D1514, 2021 01 08.
Article En | MEDLINE | ID: mdl-33180112

Europe PMC (https://europepmc.org) is a database of research articles, including peer reviewed full text articles and abstracts, and preprints - all freely available for use via website, APIs and bulk download. This article outlines new developments since 2017 where work has focussed on three key areas: (i) Europe PMC has added to its core content to include life science preprint abstracts and a special collection of full text of COVID-19-related preprints. Europe PMC is unique as an aggregator of biomedical preprints alongside peer-reviewed articles, with over 180 000 preprints available to search. (ii) Europe PMC has significantly expanded its links to content related to the publications, such as links to Unpaywall, providing wider access to full text, preprint peer-review platforms, all major curated data resources in the life sciences, and experimental protocols. The redesigned Europe PMC website features the PubMed abstract and corresponding PMC full text merged into one article page; there is more evident and user-friendly navigation within articles and to related content, plus a figure browse feature. (iii) The expanded annotations platform offers ∼1.3 billion text mined biological terms and concepts sourced from 10 providers and over 40 global data resources.


Biological Science Disciplines/statistics & numerical data , COVID-19/prevention & control , Data Curation/statistics & numerical data , Data Mining/statistics & numerical data , Databases, Factual/statistics & numerical data , PubMed , SARS-CoV-2/isolation & purification , Biological Science Disciplines/methods , Biomedical Research/methods , Biomedical Research/statistics & numerical data , COVID-19/epidemiology , COVID-19/virology , Data Curation/methods , Data Mining/methods , Epidemics , Europe , Humans , Internet , SARS-CoV-2/physiology
4.
Nucleic Acids Res ; 49(D1): D1534-D1540, 2021 01 08.
Article En | MEDLINE | ID: mdl-33166392

Since the outbreak of the current pandemic in 2020, there has been a rapid growth of published articles on COVID-19 and SARS-CoV-2, with about 10,000 new articles added each month. This is causing an increasingly serious information overload, making it difficult for scientists, healthcare professionals and the general public to remain up to date on the latest SARS-CoV-2 and COVID-19 research. Hence, we developed LitCovid (https://www.ncbi.nlm.nih.gov/research/coronavirus/), a curated literature hub, to track up-to-date scientific information in PubMed. LitCovid is updated daily with newly identified relevant articles organized into curated categories. To support manual curation, advanced machine-learning and deep-learning algorithms have been developed, evaluated and integrated into the curation workflow. To the best of our knowledge, LitCovid is the first-of-its-kind COVID-19-specific literature resource, with all of its collected articles and curated data freely available. Since its release, LitCovid has been widely used, with millions of accesses by users worldwide for various information needs, such as evidence synthesis, drug discovery and text and data mining, among others.


COVID-19/prevention & control , Data Curation/statistics & numerical data , Data Mining/statistics & numerical data , Databases, Factual , PubMed/statistics & numerical data , SARS-CoV-2/isolation & purification , COVID-19/epidemiology , COVID-19/virology , Data Curation/methods , Data Mining/methods , Humans , Internet , Machine Learning , Pandemics , Publications/statistics & numerical data , SARS-CoV-2/physiology
6.
Molecules ; 24(8)2019 Apr 23.
Article En | MEDLINE | ID: mdl-31018579

The Toxicology in the 21st Century (Tox21) project seeks to develop and test methods for high-throughput examination of the effect certain chemical compounds have on biological systems. Although primary and toxicity assay data were readily available for multiple reporter gene modified cell lines, extensive annotation and curation was required to improve these datasets with respect to how FAIR (Findable, Accessible, Interoperable, and Reusable) they are. In this study, we fully annotated the Tox21 published data with relevant and accepted controlled vocabularies. After removing unreliable data points, we aggregated the results and created three sets of signatures reflecting activity in the reporter gene assays, cytotoxicity, and selective reporter gene activity, respectively. We benchmarked these signatures using the chemical structures of the tested compounds and obtained generally high receiver operating characteristic (ROC) scores, suggesting good quality and utility of these signatures and the underlying data. We analyzed the results to identify promiscuous individual compounds and chemotypes for the three signature categories and interpreted the results to illustrate the utility and re-usability of the datasets. With this study, we aimed to demonstrate the importance of data standards in reporting screening results and high-quality annotations to enable re-use and interpretation of these data. To improve the data with respect to all FAIR criteria, all assay annotations, cleaned and aggregate datasets, and signatures were made available as standardized dataset packages (Aggregated Tox21 bioactivity data, 2019).


Data Curation/statistics & numerical data , Gene Expression Regulation/drug effects , Metadata/standards , Pharmacogenetics/methods , Toxicology/methods , Xenobiotics/toxicity , Benchmarking , Datasets as Topic , Gene Expression Profiling , Genes, Reporter , High-Throughput Screening Assays/standards , Humans , Xenobiotics/chemistry , Xenobiotics/classification
7.
Genet Epidemiol ; 43(4): 356-364, 2019 06.
Article En | MEDLINE | ID: mdl-30657194

When interpreting genome-wide association peaks, it is common to annotate each peak by searching for genes with plausible relationships to the trait. However, "all that glitters is not gold"-one might interpret apparent patterns in the data as plausible even when the peak is a false positive. Accordingly, we sought to see how human annotators interpreted association results containing a mixture of peaks from both the original trait and a genetically uncorrelated "synthetic" trait. Two of us prepared a mix of original and synthetic peaks of three significance categories from five different scans along with relevant literature search results and then we all annotated these regions. Three annotators also scored the strength of evidence connecting each peak to the scanned trait and the likelihood of further studying that region. While annotators found original peaks to have stronger evidence (p Bonferroni = 0.017) and higher likelihood of further study ( p Bonferroni = 0.006) than synthetic peaks, annotators often made convincing connections between the synthetic peaks and the original trait, finding these connections 55% of the time. These results show that it is not difficult for annotators to make convincing connections between synthetic association signals and genes found in those regions.


Data Curation , Data Interpretation, Statistical , False Positive Reactions , Genome-Wide Association Study/statistics & numerical data , Data Curation/methods , Data Curation/standards , Data Curation/statistics & numerical data , Deception , Genome-Wide Association Study/standards , Humans , Phenotype , Polymorphism, Single Nucleotide
8.
PLoS Comput Biol ; 14(8): e1006390, 2018 08.
Article En | MEDLINE | ID: mdl-30102703

Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases.


Data Curation/methods , Information Storage and Retrieval/methods , Data Curation/statistics & numerical data , Databases, Genetic , Databases, Protein , Deep Learning , Genomics , Knowledge Bases , Machine Learning , Publications
9.
Bioinformatics ; 33(21): 3454-3460, 2017 Nov 01.
Article En | MEDLINE | ID: mdl-29036270

MOTIVATION: Biological knowledgebases, such as UniProtKB/Swiss-Prot, constitute an essential component of daily scientific research by offering distilled, summarized and computable knowledge extracted from the literature by expert curators. While knowledgebases play an increasingly important role in the scientific community, their ability to keep up with the growth of biomedical literature is under scrutiny. Using UniProtKB/Swiss-Prot as a case study, we address this concern via multiple literature triage approaches. RESULTS: With the assistance of the PubTator text-mining tool, we tagged more than 10 000 articles to assess the ratio of papers relevant for curation. We first show that curators read and evaluate many more papers than they curate, and that measuring the number of curated publications is insufficient to provide a complete picture as demonstrated by the fact that 8000-10 000 papers are curated in UniProt each year while curators evaluate 50 000-70 000 papers per year. We show that 90% of the papers in PubMed are out of the scope of UniProt, that a maximum of 2-3% of the papers indexed in PubMed each year are relevant for UniProt curation, and that, despite appearances, expert curation in UniProt is scalable. AVAILABILITY AND IMPLEMENTATION: UniProt is freely available at http://www.uniprot.org/. CONTACT: sylvain.poux@sib.swiss. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Data Curation , Databases, Protein , Data Curation/statistics & numerical data , Data Mining , Databases, Protein/statistics & numerical data , Humans , Knowledge Bases , PubMed/statistics & numerical data , Review Literature as Topic , Statistics as Topic
10.
Appl Neuropsychol Adult ; 22(6): 399-406, 2015.
Article En | MEDLINE | ID: mdl-25785544

Davis, Axelrod, McHugh, Hanks, and Millis (2013) documented that in a battery of 25 tests, producing 15, 10, and 5 abnormal scores at 1, 1.5, and 2 standard deviations below the norm-referenced mean, respectively, and an overall test battery mean (OTBM) of T ≤ 38 accurately identifies performance invalidity. However, generalizability of these findings to other samples and test batteries remains unclear. This study evaluated the use of abnormal scores and the OTBM as performance validity measures in a different sample that was administered a 25-test battery that minimally overlapped with Davis et al.'s test battery. Archival analysis of 48 examinees with mild traumatic brain injury seen for medico-legal purposes was conducted. Producing 18 or more, 7 or more, and 5 or more abnormal scores at 1, 1.5, and 2 standard deviations below the norm-referenced mean, respectively, and an OTBM of T ≤ 40 most accurately classified examinees; however, using Davis et al.'s proposed cutoffs in the current sample maintained specificity at or near acceptable levels. Due to convergence across studies, producing ≥5 abnormal scores at 2 standard deviations below the norm-referenced mean is the most appropriate cutoff for clinical implementation; however, for batteries consisting of a different quantity of tests than 25, an OTBM of T ≤ 38 is more appropriate.


Brain Injuries/complications , Cognition Disorders/diagnosis , Cognition Disorders/etiology , Neuropsychological Tests , Adult , Data Curation/statistics & numerical data , Disability Evaluation , Female , Humans , Male , Middle Aged , Psychometrics , Reference Values , Reproducibility of Results , Sensitivity and Specificity
12.
Stud Health Technol Inform ; 205: 116-20, 2014.
Article En | MEDLINE | ID: mdl-25160157

Evaluation and validation have become a crucial problem for the development of semantic resources. We developed Ci4SeR, a Graphical User Interface to optimize the curation work (not taking into account structural aspects), suitable for any type of resource with lightweight description logic. We tested it on OntoADR, an ontology of adverse drug reactions. A single curator has reviewed 326 terms (1020 axioms) in an estimated time of 120 hours (2.71 concepts and 8.5 axioms reviewed per hour) and added 1874 new axioms (15.6 axioms per hour). Compared with previous manual endeavours, the interface allows increasing the speed-rate of reviewed concepts by 68% and axiom addition by 486%. A wider use of Ci4SeR would help semantic resources curation and improve completeness of knowledge modelling.


Adverse Drug Reaction Reporting Systems/statistics & numerical data , Data Curation/statistics & numerical data , Electronic Health Records/statistics & numerical data , Medical Record Linkage/methods , Semantics , Software , User-Computer Interface , Data Curation/methods , France , Information Storage and Retrieval/methods , Information Storage and Retrieval/statistics & numerical data , Natural Language Processing , Software Design , Vocabulary, Controlled
13.
Stud Health Technol Inform ; 205: 599-603, 2014.
Article En | MEDLINE | ID: mdl-25160256

Clinicians need historical information that does not change over time, as well as other information from the notes of others to inform their documentation--to save time, they cut and paste since that is a feature in many conventional EHRs. Copy and paste is a solution to clinicians' needs that has associated downsides including errors. As part of a study of clinicians using an innovative system which gives them complete control over information selection and arrangement, two used the process of note splitting to meet needs that are sometimes solved through cut and paste; four others used text insertion (partial note sections) to address related needs. The purpose of this study is to enhance understanding of the note splitting and text insertion phenomena by describing the processes, the resulting creations, and the associated clinician rationales. Mixed methods included a thinkaloud protocol and analysis of user interface creations and time sequences.


Documentation/methods , Electronic Health Records , Information Storage and Retrieval/methods , Needs Assessment , Utilization Review , Word Processing/statistics & numerical data , Writing , Data Curation/methods , Data Curation/statistics & numerical data , Documentation/statistics & numerical data , Information Storage and Retrieval/statistics & numerical data , Practice Patterns, Physicians'/statistics & numerical data , User-Computer Interface
14.
Comput Methods Programs Biomed ; 117(2): 104-13, 2014 Nov.
Article En | MEDLINE | ID: mdl-25168774

Studies on health domain have shown that health websites provide imperfect information and give recommendations which are not up to date with the recent literature even when their last modified dates are quite recent. In this paper, we propose a framework which assesses the timeliness of the content of health websites automatically by evidence based medicine. Our aim is to assess the accordance of website contents with the current literature and information timeliness disregarding the update time stated on the websites. The proposed method is based on automatic term recognition, relevance feedback and information retrieval techniques in order to generate time-aware structured queries. We tested the framework on diabetes health web sites which were archived between 2006 and 2013 by Archive-it using American Diabetes Association's (ADA) guidelines. The results showed that the proposed framework achieves 65% and 77% accuracy in detecting the timeliness of the web content according to years and pre-determined time intervals respectively. Information seekers and web site owners may benefit from the proposed framework in finding relevant and up-to-date diabetes web sites.


Clinical Trials as Topic/statistics & numerical data , Data Curation/statistics & numerical data , Data Mining/methods , Diabetes Mellitus , Natural Language Processing , Periodicals as Topic/statistics & numerical data , Social Media/statistics & numerical data , Clinical Trials as Topic/classification , Data Curation/classification , Evidence-Based Medicine , Humans , Periodicals as Topic/classification , Social Media/classification , Time Factors
17.
Stud Health Technol Inform ; 192: 1196, 2013.
Article En | MEDLINE | ID: mdl-23920970

OBJECTIVE: To quantitatively describe (1) differences between search results derived at consecutive time points with the PubMed and OvidSP literature search interfaces over a five day interval, and (2) the migration of citations through different subsets to estimate the timeliness of OvidSP. METHODS: PubMed-Identifiers (PMIDs) of the following subsets were retrieved from PubMed and OvidSP simultaneously (within 8 h) at 11 days in March and April 2010 including 5 consecutive days: as supplied by publisher, in process, PubMed not MEDLINE, and OLDMEDLINE. Search results were compared for difference and intersection sets. The migration of citations on individual level was determined by comparison of corresponding sets over several days. RESULTS: The "in process" set was stable with about 446,000 - 452,000 citations; a small fraction of up to 3 % of the total subsets were in PubMed only and OvidSP only subsets. About 96 % of the ca. 10,500 citations in the OvidSP only subset migrated within 2 days out of the "in process" subset. The database of OvidSP is updated within a period of two days.


Data Curation/statistics & numerical data , Database Management Systems/statistics & numerical data , Information Storage and Retrieval/statistics & numerical data , Natural Language Processing , Periodicals as Topic/statistics & numerical data , PubMed/statistics & numerical data , Search Engine/statistics & numerical data , Abstracting and Indexing , Time Factors
18.
Stud Health Technol Inform ; 192: 1001, 2013.
Article En | MEDLINE | ID: mdl-23920775

Formats for data storage in personal computers vary according to manufacturer and models for personal health-monitoring devices such as blood-pressure and body-composition meters. In contrast, the data format of images from digital cameras is unified into a JPEG format with an Exif area and is already familiar to many users. We have devised a method that can contain health data as a JPEG file. Health data is stored in the Exif area in JPEG in a HL7 format. There is, however, a capacity limit of 64 KB for the Exif area. The aim of this study is to examine how much health data can actually be stored in the Exif area. We found that even with combined data from multiple devices, it was possible to store over a month of health data in a JPEG file, and using multiple JPEG files simply overcomes this limit. We believe that this method will help people to more easily handle health data regardless of the various device modelsthey use.


Computer Graphics/statistics & numerical data , Computer Graphics/standards , Data Compression/statistics & numerical data , Data Compression/standards , Electronic Health Records/standards , Information Storage and Retrieval/statistics & numerical data , Information Storage and Retrieval/standards , Data Curation/standards , Data Curation/statistics & numerical data , Health Level Seven/standards
19.
Stud Health Technol Inform ; 192: 1021, 2013.
Article En | MEDLINE | ID: mdl-23920795

Standard Japanese electronic medical record (EMR) systems are associated with major shortcomings. For example, they do not assure lifelong readability of records because each document requires its own viewing software program, a system that is difficult to maintain over long periods of time. It can also be difficult for users to comprehend a patient's clinical history because different classes of documents can only be accessed from their own window. To address these problems, we developed a document-based electronic medical record that aggregates all documents for a patient in a PDF or DocuWorks format. We call this system the Document Archiving and Communication System (DACS). There are two types of viewers in the DACS: the Matrix View, which provides a time line of a patient's history, and the Tree View, which stores the documents in hierarchical document classes. We placed 2,734 document classes into 11 categories. A total of 22,3972 documents were entered per month. The frequency of use of the DACS viewer was 268,644 instances per month. The DACS viewer was used to assess a patient's clinical history.


Data Curation/statistics & numerical data , Electronic Health Records/statistics & numerical data , Hospital Communication Systems/statistics & numerical data , Information Storage and Retrieval/statistics & numerical data , Meaningful Use/statistics & numerical data , Utilization Review , Japan
...