RESUMEN
Hypersensitivity reactions to drugs are often unpredictable and can be life threatening, underscoring a need for understanding their underlying mechanisms and risk factors. The extent to which germline genetic variation influences the risk of commonly reported drug allergies such as penicillin allergy remains largely unknown. We extracted data from the electronic health records of more than 600,000 participants from the UK, Estonian, and Vanderbilt University Medical Center's BioVU biobanks to study the role of genetic variation in the occurrence of self-reported penicillin hypersensitivity reactions. We used imputed SNP to HLA typing data from these cohorts to further fine map the human leukocyte antigen (HLA) association and replicated our results in 23andMe's research cohort involving a total of 1.12 million individuals. Genome-wide meta-analysis of penicillin allergy revealed two loci, including one located in the HLA region on chromosome 6. This signal was further fine-mapped to the HLA-B∗55:01 allele (OR 1.41 95% CI 1.33-1.49, p value 2.04 × 10-31) and confirmed by independent replication in 23andMe's research cohort (OR 1.30 95% CI 1.25-1.34, p value 1.00 × 10-47). The lead SNP was also associated with lower lymphocyte counts and in silico follow-up suggests a potential effect on T-lymphocytes at HLA-B∗55:01. We also observed a significant hit in PTPN22 and the GWAS results correlated with the genetics of rheumatoid arthritis and psoriasis. We present robust evidence for the role of an allele of the major histocompatibility complex (MHC) I gene HLA-B in the occurrence of penicillin allergy.
Asunto(s)
Artritis Reumatoide/genética , Hipersensibilidad a las Drogas/genética , Antígenos HLA-B/genética , Polimorfismo de Nucleótido Simple , Proteína Tirosina Fosfatasa no Receptora Tipo 22/genética , Psoriasis/genética , Adulto , Alelos , Artritis Reumatoide/complicaciones , Artritis Reumatoide/inmunología , Cromosomas Humanos Par 6/química , Hipersensibilidad a las Drogas/complicaciones , Hipersensibilidad a las Drogas/etiología , Hipersensibilidad a las Drogas/inmunología , Registros Electrónicos de Salud , Europa (Continente) , Femenino , Expresión Génica , Sitios Genéticos , Predisposición Genética a la Enfermedad , Genoma Humano , Estudio de Asociación del Genoma Completo , Antígenos HLA-B/inmunología , Prueba de Histocompatibilidad , Humanos , Masculino , Penicilinas/efectos adversos , Proteína Tirosina Fosfatasa no Receptora Tipo 22/inmunología , Psoriasis/complicaciones , Psoriasis/inmunología , Autoinforme , Linfocitos T/inmunología , Linfocitos T/patología , Estados UnidosRESUMEN
MOTIVATION: One of the main goals of large scale methylation studies is to detect differentially methylated loci. One way is to approach this problem sitewise, i.e. to find differentially methylated positions (DMPs). However, it has been shown that methylation is regulated in longer genomic regions. So it is more desirable to identify differentially methylated regions (DMRs) instead of DMPs. The new high coverage arrays, like Illuminas 450k platform, make it possible at a reasonable cost. Few tools exist for DMR identification from this type of data, but there is no standard approach. RESULTS: We propose a novel method for DMR identification that detects the region boundaries according to the minimum description length (MDL) principle, essentially solving the problem of model selection. The significance of the regions is established using linear mixed models. Using both simulated and large publicly available methylation datasets, we compare seqlm performance to alternative approaches. We demonstrate that it is both more sensitive and specific than competing methods. This is achieved with minimal parameter tuning and, surprisingly, quickest running time of all the tried methods. Finally, we show that the regional differential methylation patterns identified on sparse array data are confirmed by higher resolution sequencing approaches. AVAILABILITY AND IMPLEMENTATION: The methods have been implemented in R package seqlm that is available through Github: https://github.com/raivokolde/seqlm CONTACT: rkolde@gmail.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Metilación de ADN , Conjuntos de Datos como Asunto , Genoma , Genómica , Secuenciación de Nucleótidos de Alto RendimientoRESUMEN
MOTIVATION: Increased availability of various genotyping techniques has initiated a race for finding genetic markers that can be used in diagnostics and personalized medicine. Although many genetic risk factors are known, key causes of common diseases with complex heritage patterns are still unknown. Identification of such complex traits requires a targeted study over a large collection of data. Ideally, such studies bring together data from many biobanks. However, data aggregation on such a large scale raises many privacy issues. RESULTS: We show how to conduct such studies without violating privacy of individual donors and without leaking the data to third parties. The presented solution has provable security guarantees. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Estudio de Asociación del Genoma Completo , Privacidad , Estudios de Casos y Controles , Interpretación Estadística de Datos , Técnicas de Genotipaje , HumanosRESUMEN
BACKGROUND: Biological data acquisition is raising new challenges, both in data analysis and handling. Not only is it proving hard to analyze the data at the rate it is generated today, but simply reading and transferring data files can be prohibitively slow due to their size. This primarily concerns logistics within and between data centers, but is also important for workstation users in the analysis phase. Common usage patterns, such as comparing and transferring files, are proving computationally expensive and are tying down shared resources. RESULTS: We present an efficient method for calculating file uniqueness for large scientific data files, that takes less computational effort than existing techniques. This method, called Probabilistic Fast File Fingerprinting (PFFF), exploits the variation present in biological data and computes file fingerprints by sampling randomly from the file instead of reading it in full. Consequently, it has a flat performance characteristic, correlated with data variation rather than file size. We demonstrate that probabilistic fingerprinting can be as reliable as existing hashing techniques, with provably negligible risk of collisions. We measure the performance of the algorithm on a number of data storage and access technologies, identifying its strengths as well as limitations. CONCLUSIONS: Probabilistic fingerprinting may significantly reduce the use of computational resources when comparing very large files. Utilisation of probabilistic fingerprinting techniques can increase the speed of common file-related workflows, both in the data center and for workbench analysis. The implementation of the algorithm is available as an open-source tool named pfff, as a command-line tool as well as a C library. The tool can be downloaded from http://biit.cs.ut.ee/pfff.
Asunto(s)
Algoritmos , Procesamiento Automatizado de Datos/métodos , Almacenamiento y Recuperación de la Información/métodos , Programas InformáticosRESUMEN
MOTIVATION: The continued progress in developing technological platforms, availability of many published experimental datasets, as well as different statistical methods to analyze those data have allowed approaching the same research question using various methods simultaneously. To get the best out of all these alternatives, we need to integrate their results in an unbiased manner. Prioritized gene lists are a common result presentation method in genomic data analysis applications. Thus, the rank aggregation methods can become a useful and general solution for the integration task. RESULTS: Standard rank aggregation methods are often ill-suited for biological settings where the gene lists are inherently noisy. As a remedy, we propose a novel robust rank aggregation (RRA) method. Our method detects genes that are ranked consistently better than expected under null hypothesis of uncorrelated inputs and assigns a significance score for each gene. The underlying probabilistic model makes the algorithm parameter free and robust to outliers, noise and errors. Significance scores also provide a rigorous way to keep only the statistically relevant genes in the final list. These properties make our approach robust and compelling for many settings. AVAILABILITY: All the methods are implemented as a GNU R package RobustRankAggreg, freely available at the Comprehensive R Archive Network http://cran.r-project.org/.
Asunto(s)
Algoritmos , Biología Computacional/métodos , Genómica , Animales , Perfilación de la Expresión Génica , Técnicas de Inactivación de Genes , Humanos , Metaanálisis como Asunto , Ratones , Células Madre/metabolismo , Levaduras/genéticaRESUMEN
Neural network language models, such as BERT, can be used for information extraction from medical texts with unstructured free text. These models can be pre-trained on a large corpus to learn the language and characteristics of the relevant domain and then fine-tuned with labeled data for a specific task. We propose a pipeline using human-in-the-loop labeling to create annotated data for Estonian healthcare information extraction. This method is particularly useful for low-resource languages and is more accessible to those in the medical field than rule-based methods like regular expressions.
Asunto(s)
Almacenamiento y Recuperación de la Información , Procesamiento de Lenguaje Natural , Humanos , Redes Neurales de la Computación , Lenguaje , Instituciones de SaludRESUMEN
BACKGROUND: Ischemic stroke (IS) is a major health risk without generally usable effective measures of primary prevention. Early warning signals that are easy to detect and widely available can save lives. Estonia has one nation-wide Electronic Health Record (EHR) database for the storage of medical information of patients from hospitals and primary care providers. METHODS: We extracted structured and unstructured data from the EHRs of participants of the Estonian Biobank (EstBB) and evaluated different formats of input data to understand how this continuously growing dataset should be prepared for best prediction. The utility of the EHR database for finding blood- and urine-based biomarkers for IS was demonstrated by applying different analytical and machine learning (ML) methods. RESULTS: Several early trends in common clinical laboratory parameter changes (set of red blood indices, lymphocyte/neutrophil ratio, etc.) were established for IS prediction. The developed ML models predicted the future occurrence of IS with very high accuracy and Random Forests was proved as the most applicable method to EHR data. CONCLUSIONS: We conclude that the EHR database and the risk factors uncovered are valuable resources in screening the population for risk of IS as well as constructing disease risk scores and refining prediction models for IS by ML.
Asunto(s)
Registros Electrónicos de Salud , Accidente Cerebrovascular Isquémico , Humanos , Estonia/epidemiología , Factores de Riesgo , BiomarcadoresRESUMEN
Objective: To describe the reusable transformation process of electronic health records (EHR), claims, and prescriptions data into Observational Medical Outcome Partnership (OMOP) Common Data Model (CDM), together with challenges faced and solutions implemented. Materials and Methods: We used Estonian national health databases that store almost all residents' claims, prescriptions, and EHR records. To develop and demonstrate the transformation process of Estonian health data to OMOP CDM, we used a 10% random sample of the Estonian population (n = 150 824 patients) from 2012 to 2019 (MAITT dataset). For the sample, complete information from all 3 databases was converted to OMOP CDM version 5.3. The validation was performed using open-source tools. Results: In total, we transformed over 100 million entries to standard concepts using standard OMOP vocabularies with the average mapping rate 95%. For conditions, observations, drugs, and measurements, the mapping rate was over 90%. In most cases, SNOMED Clinical Terms were used as the target vocabulary. Discussion: During the transformation process, we encountered several challenges, which are described in detail with concrete examples and solutions. Conclusion: For a representative 10% random sample, we successfully transferred complete records from 3 national health databases to OMOP CDM and created a reusable transformation process. Our work helps future researchers to transform linked databases into OMOP CDM more efficiently, ultimately leading to better real-world evidence.
RESUMEN
Objective: To develop a framework for identifying temporal clinical event trajectories from Observational Medical Outcomes Partnership-formatted observational healthcare data. Materials and Methods: A 4-step framework based on significant temporal event pair detection is described and implemented as an open-source R package. It is used on a population-based Estonian dataset to first replicate a large Danish population-based study and second, to conduct a disease trajectory detection study for type 2 diabetes patients in the Estonian and Dutch databases as an example. Results: As a proof of concept, we apply the methods in the Estonian database and provide a detailed breakdown of our findings. All Estonian population-based event pairs are shown. We compare the event pairs identified from Estonia to Danish and Dutch data and discuss the causes of the differences. The overlap in the results was only 2.4%, which highlights the need for running similar studies in different populations. Conclusions: For the first time, there is a complete software package for detecting disease trajectories in health data.
RESUMEN
We improve the quality of cryptographically privacy-preserving genome-wide association studies by correctly handling population stratification-the inherent genetic difference of patient groups, e.g., people with different ancestries. Our approach is to use principal component analysis to reduce the dimensionality of the problem so that we get less spurious correlations between traits of interest and certain positions in the genome. While this approach is commonplace in practical genomic analysis, it has not been used within a privacy-preserving setting. In this paper, we use cryptographically secure multi-party computation to tackle principal component analysis, and present an implementation and experimental results showing the performance of the approach.
Asunto(s)
Algoritmos , Bases de Datos Genéticas , Privacidad Genética , Genómica/métodos , Análisis de Componente Principal/métodos , Seguridad ComputacionalRESUMEN
Transcription factors are proteins that bind to motifs on the DNA and thus affect gene expression regulation. The qualitative description of the corresponding processes is therefore important for a better understanding of essential biological mechanisms. However, wet lab experiments targeted at the discovery of the regulatory interplay between transcription factors and binding sites are expensive. We propose a new, purely computational method for finding putative associations between transcription factors and motifs. This method is based on a linear model that combines sequence information with expression data. We present various methods for model parameter estimation and show, via experiments on simulated data, that these methods are reliable. Finally, we examine the performance of this model on biological data and conclude that it can indeed be used to discover meaningful associations. The developed software is available as a web tool and Scilab source code at http://biit.cs.ut.ee/gmat/.