|

Data-driven information extraction and enrichment of molecular profiling data for cancer cell lines.

Smith, Ellery; Paloots, Rahel; Giagkos, Dimitris; Baudis, Michael; Stockinger, Kurt.

Bioinform Adv ; 4(1): vbae045, 2024.

Article En | MEDLINE | ID: mdl-38560553

Motivation: With the proliferation of research means and computational methodologies, published biomedical literature is growing exponentially in numbers and volume. Cancer cell lines are frequently used models in biological and medical research that are currently applied for a wide range of purposes, from studies of cellular mechanisms to drug development, which has led to a wealth of related data and publications. Sifting through large quantities of text to gather relevant information on cell lines of interest is tedious and extremely slow when performed by humans. Hence, novel computational information extraction and correlation mechanisms are required to boost meaningful knowledge extraction. Results: In this work, we present the design, implementation, and application of a novel data extraction and exploration system. This system extracts deep semantic relations between textual entities from scientific literature to enrich existing structured clinical data concerning cancer cell lines. We introduce a new public data exploration portal, which enables automatic linking of genomic copy number variants plots with ranked, related entities such as affected genes. Each relation is accompanied by literature-derived evidences, allowing for deep, yet rapid, literature search, using existing structured data as a springboard. Availability and implementation: Our system is publicly available on the web at https://cancercelllines.org.

cancercelllines.org-a novel resource for genomic variants in cancer cell lines.

Paloots, Rahel; Baudis, Michael.

Database (Oxford) ; 20242024 Apr 30.

Article En | MEDLINE | ID: mdl-38687868

Cancer cell lines are an important component in biological and medical research, enabling studies of cellular mechanisms as well as the development and testing of pharmaceuticals. Genomic alterations in cancer cell lines are widely studied as models for oncogenetic events and are represented in a wide range of primary resources. We have created a comprehensive, curated knowledge resource-cancercelllines.org-with the aim to enable easy access to genomic profiling data in cancer cell lines, curated from a variety of resources and integrating both copy number and single nucleotide variants data. We have gathered over 5600 copy number profiles as well as single nucleotide variant annotations for 16 000 cell lines and provide these data with mappings to the GRCh38 reference genome. Both genomic variations and associated curated metadata can be queried through the GA4GH Beacon v2 Application Programming Interface (API) and a graphical user interface with extensive data retrieval enabled using GA4GH data schemas under a permissive licensing scheme. Database URL: https://cancercelllines.org.

Databases, Genetic , Genomics , Neoplasms , Humans , Cell Line, Tumor , Neoplasms/genetics , Genomics/methods , DNA Copy Number Variations/genetics , User-Computer Interface , Polymorphism, Single Nucleotide

Beacon v2 and Beacon networks: A "lingua franca" for federated data discovery in biomedical genomics, and beyond.

Rambla, Jordi; Baudis, Michael; Ariosa, Roberto; Beck, Tim; Fromont, Lauren A; Navarro, Arcadi; Paloots, Rahel; Rueda, Manuel; Saunders, Gary; Singh, Babita; Spalding, John D; Törnroos, Juha; Vasallo, Claudia; Veal, Colin D; Brookes, Anthony J.

Hum Mutat ; 43(6): 791-799, 2022 06.

Article En | MEDLINE | ID: mdl-35297548

Beacon is a basic data discovery protocol issued by the Global Alliance for Genomics and Health (GA4GH). The main goal addressed by version 1 of the Beacon protocol was to test the feasibility of broadly sharing human genomic data, through providing simple "yes" or "no" responses to queries about the presence of a given variant in datasets hosted by Beacon providers. The popularity of this concept has fostered the design of a version 2, that better serves real-world requirements and addresses the needs of clinical genomics research and healthcare, as assessed by several contributing projects and organizations. Particularly, rare disease genetics and cancer research will benefit from new case level and genomic variant level requests and the enabling of richer phenotype and clinical queries as well as support for fuzzy searches. Beacon is designed as a "lingua franca" to bridge data collections hosted in software solutions with different and rich interfaces. Beacon version 2 works alongside popular standards like Phenopackets, OMOP, or FHIR, allowing implementing consortia to return matches in beacon responses and provide a handover to their preferred data exchange format. The protocol is being explored by other research domains and is being tested in several international projects.

Genomics , Information Dissemination , Humans , Information Dissemination/methods , Phenotype , Rare Diseases , Software

The Progenetix oncogenomic resource in 2021.

Huang, Qingyao; Carrio-Cordo, Paula; Gao, Bo; Paloots, Rahel; Baudis, Michael.

Database (Oxford) ; 20212021 07 17.

Article En | MEDLINE | ID: mdl-34272855

In cancer, copy number aberrations (CNAs) represent a type of nearly ubiquitous and frequently extensive structural genome variations. To disentangle the molecular mechanisms underlying tumorigenesis as well as identify and characterize molecular subtypes, the comparative and meta-analysis of large genomic variant collections can be of immense importance. Over the last decades, cancer genomic profiling projects have resulted in a large amount of somatic genome variation profiles, however segregated in a multitude of individual studies and datasets. The Progenetix project, initiated in 2001, curates individual cancer CNA profiles and associated metadata from published oncogenomic studies and data repositories with the aim to empower integrative analyses spanning all different cancer biologies. During the last few years, the fields of genomics and cancer research have seen significant advancement in terms of molecular genetics technology, disease concepts, data standard harmonization as well as data availability, in an increasingly structured and systematic manner. For the Progenetix resource, continuous data integration, curation and maintenance have resulted in the most comprehensive representation of cancer genome CNA profiling data with 138 663 (including 115 357 tumor) copy number variation (CNV) profiles. In this article, we report a 4.5-fold increase in sample number since 2013, improvements in data quality, ontology representation with a CNV landscape summary over 51 distinctive National Cancer Institute Thesaurus cancer terms as well as updates in database schemas, and data access including new web front-end and programmatic data access. Database URL: progenetix.org.

DNA Copy Number Variations , Neoplasms , DNA Copy Number Variations/genetics , Genome , Genomics , Humans , Neoplasms/genetics