Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 141
Filter
Add more filters

Country/Region as subject
Publication year range
1.
PLoS Genet ; 20(5): e1011273, 2024 May.
Article in English | MEDLINE | ID: mdl-38728357

ABSTRACT

Existing imaging genetics studies have been mostly limited in scope by using imaging-derived phenotypes defined by human experts. Here, leveraging new breakthroughs in self-supervised deep representation learning, we propose a new approach, image-based genome-wide association study (iGWAS), for identifying genetic factors associated with phenotypes discovered from medical images using contrastive learning. Using retinal fundus photos, our model extracts a 128-dimensional vector representing features of the retina as phenotypes. After training the model on 40,000 images from the EyePACS dataset, we generated phenotypes from 130,329 images of 65,629 British White participants in the UK Biobank. We conducted GWAS on these phenotypes and identified 14 loci with genome-wide significance (p<5×10-8 and intersection of hits from left and right eyes). We also did GWAS on the retina color, the average color of the center region of the retinal fundus photos. The GWAS of retina colors identified 34 loci, 7 are overlapping with GWAS of raw image phenotype. Our results establish the feasibility of this new framework of genomic study based on self-supervised phenotyping of medical images.


Subject(s)
Fundus Oculi , Genome-Wide Association Study , Phenotype , Retina , Humans , Genome-Wide Association Study/methods , Retina/diagnostic imaging , Male , Polymorphism, Single Nucleotide , Female , Image Processing, Computer-Assisted/methods
2.
Genome Res ; 33(7): 1007-1014, 2023 07.
Article in English | MEDLINE | ID: mdl-37316352

ABSTRACT

The Li and Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel. For small panels, the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics. However, LS becomes inefficient when sample size is large, because of its linear time complexity. Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer a fast method for giving some optimal solution (Viterbi) to the LS HMM. Previously, we introduced the minimal positional substring cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel. The MPSC formulation allows the generation of a haplotype threading in time constant to sample size (O(N)). This allows haplotype threading on very large biobank-scale panels on which the LS model is infeasible. Here, we present new results on the solution space of the MPSC. In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the length maximal MPSC, and h-MPSC solutions. In doing so, our algorithms reveal the solution space of LS for large panels. We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation.


Subject(s)
Algorithms , Software , Humans , Haplotypes , Genotype , Ethnicity
3.
Genome Res ; 33(7): 1015-1022, 2023 07.
Article in English | MEDLINE | ID: mdl-37349109

ABSTRACT

Although rates of recombination events across the genome (genetic maps) are fundamental to genetic research, the majority of current studies only use one standard map. There is evidence suggesting population differences in genetic maps, and thus estimating population-specific maps, are of interest. Although the recent availability of biobank-scale data offers such opportunities, current methods are not efficient at leveraging very large sample sizes. The most accurate methods are still linkage disequilibrium (LD)-based methods that are only tractable for a few hundred samples. In this work, we propose a fast and memory-efficient method for estimating genetic maps from population genotyping data. Our method, FastRecomb, leverages the efficient positional Burrows-Wheeler transform (PBWT) data structure for counting IBD segment boundaries as potential recombination events. We used PBWT blocks to avoid redundant counting of pairwise matches. Moreover, we used a panel-smoothing technique to reduce the noise from errors and recent mutations. Using simulation, we found that FastRecomb achieves state-of-the-art performance at 10-kb resolution, in terms of correlation coefficients between the estimated map and the ground truth. This is mainly because FastRecomb can effectively take advantage of large panels comprising more than hundreds of thousands of haplotypes. At the same time, other methods lack the efficiency to handle such data. We believe further refinement of FastRecomb would deliver more accurate genetic maps for the genetics community.


Subject(s)
Biological Specimen Banks , Genome , Haplotypes , Linkage Disequilibrium , Polymorphism, Single Nucleotide , Recombination, Genetic
4.
PLoS Genet ; 19(12): e1011057, 2023 Dec.
Article in English | MEDLINE | ID: mdl-38039339

ABSTRACT

Although genome-wide association studies (GWAS) have identified tens of thousands of genetic loci, the genetic architecture is still not fully understood for many complex traits. Most GWAS and sequencing association studies have focused on single nucleotide polymorphisms or copy number variations, including common and rare genetic variants. However, phased haplotype information is often ignored in GWAS or variant set tests for rare variants. Here we leverage the identity-by-descent (IBD) segments inferred from a random projection-based IBD detection algorithm in the mapping of genetic associations with complex traits, to develop a computationally efficient statistical test for IBD mapping in biobank-scale cohorts. We used sparse linear algebra and random matrix algorithms to speed up the computation, and a genome-wide IBD mapping scan of more than 400,000 samples finished within a few hours. Simulation studies showed that our new method had well-controlled type I error rates under the null hypothesis of no genetic association in large biobank-scale cohorts, and outperformed traditional GWAS single-variant tests when the causal variants were untyped and rare, or in the presence of haplotype effects. We also applied our method to IBD mapping of six anthropometric traits using the UK Biobank data and identified a total of 3,442 associations, 2,131 (62%) of which remained significant after conditioning on suggestive tag variants in the ± 3 centimorgan flanking regions from GWAS.


Subject(s)
Biological Specimen Banks , Genome-Wide Association Study , Humans , Genome-Wide Association Study/methods , DNA Copy Number Variations , Haplotypes/genetics , Phenotype , Polymorphism, Single Nucleotide/genetics
5.
Bioinformatics ; 39(1)2023 01 01.
Article in English | MEDLINE | ID: mdl-36440908

ABSTRACT

MOTIVATION: The positional Burrows-Wheeler transform (PBWT) has led to tremendous strides in haplotype matching on biobank-scale data. For genetic genealogical search, PBWT-based methods have optimized the asymptotic runtime of finding long matches between a query haplotype and a predefined panel of haplotypes. However, to enable fast query searches, the full-sized panel and PBWT data structures must be kept in memory, preventing existing algorithms from scaling up to modern biobank panels consisting of millions of haplotypes. In this work, we propose a space-efficient variation of PBWT named Syllable-PBWT, which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function for positional substring comparison. With the Syllable-PBWT data structures, we then present a long match query algorithm named Syllable-Query. RESULTS: Compared to the most time- and space-efficient previously published solution to the long match query problem, Syllable-Query reduced the memory use by a factor of over 100 on both the UK Biobank genotype data and the 1000 Genomes Project sequence data. Surprisingly, the smaller size of our syllabic data structures allows for more efficient iteration and CPU cache usage, granting Syllable-Query even faster runtime than existing solutions. AVAILABILITY AND IMPLEMENTATION: https://github.com/ZhiGroup/Syllable-PBWT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Genome , Haplotypes , Genotype , Software , Sequence Analysis, DNA/methods
6.
Bioinformatics ; 39(6)2023 06 01.
Article in English | MEDLINE | ID: mdl-37166451

ABSTRACT

MOTIVATION: Due to the rapid growth of the genetic database size, genealogical search, a process of inferring familial relatedness by identifying DNA matches, has become a viable approach to help individuals finding missing family members or law enforcement agencies locating suspects. A fast and accurate method is needed to search an out-of-database individual against millions of individuals. Most existing approaches only offer all-versus-all within panel match. Some prototype algorithms offer one-versus-all query from out-of-panel individual, but they do not tolerate errors. RESULTS: A new method, random projection-based identity-by-descent (IBD) detection (RaPID) query, is introduced to make fast genealogical search possible. RaPID-Query identifies IBD segments between a query haplotype and a panel of haplotypes. By integrating matches over multiple PBWT indexes, RaPID-Query manages to locate IBD segments quickly with a given cutoff length while allowing mismatched sites. A single query against all UK biobank autosomal chromosomes was completed within 2.76 seconds on average, with the minimum length 7 cM and 700 markers. RaPID-Query achieved a 0.016 false negative rate and a 0.012 false positive rate simultaneously on a chromosome 20 sequencing panel having 86 265 sites. This is comparable to the state-of-the-art IBD detection method TPBWT(out-of-sample) and Hap-IBD. The high-quality IBD segments yielded by RaPID-Query were able to distinguish up to fourth degree of the familial relatedness for a given individual pair, and the area under the receiver operating characteristic curve values are at least 97.28%. AVAILABILITY AND IMPLEMENTATION: The RaPID-Query program is available at https://github.com/ucfcbb/RaPID-Query.


Subject(s)
Algorithms , Chromosomes , Humans , Haplotypes , Sequence Analysis
7.
PLoS Genet ; 17(1): e1009315, 2021 01.
Article in English | MEDLINE | ID: mdl-33476339

ABSTRACT

Inference of relationships from whole-genome genetic data of a cohort is a crucial prerequisite for genome-wide association studies. Typically, relationships are inferred by computing the kinship coefficients (ϕ) and the genome-wide probability of zero IBD sharing (π0) among all pairs of individuals. Current leading methods are based on pairwise comparisons, which may not scale up to very large cohorts (e.g., sample size >1 million). Here, we propose an efficient relationship inference method, RAFFI. RAFFI leverages the efficient RaPID method to call IBD segments first, then estimate the ϕ and π0 from detected IBD segments. This inference is achieved by a data-driven approach that adjusts the estimation based on phasing quality and genotyping quality. Using simulations, we showed that RAFFI is robust against phasing/genotyping errors, admix events, and varying marker densities, and achieves higher accuracy compared to KING, the current leading method, especially for more distant relatives. When applied to the phased UK Biobank data with ~500K individuals, RAFFI is approximately 18 times faster than KING. We expect RAFFI will offer fast and accurate relatedness inference for even larger cohorts.


Subject(s)
Genome-Wide Association Study/statistics & numerical data , Genotyping Techniques/statistics & numerical data , Models, Genetic , Biological Specimen Banks , Genome, Human/genetics , Haplotypes/genetics , Humans , Pedigree , Polymorphism, Single Nucleotide/genetics
8.
BMC Bioinformatics ; 23(Suppl 6): 281, 2022 Jul 14.
Article in English | MEDLINE | ID: mdl-35836130

ABSTRACT

BACKGROUND: Model card reports aim to provide informative and transparent description of machine learning models to stakeholders. This report document is of interest to the National Institutes of Health's Bridge2AI initiative to address the FAIR challenges with artificial intelligence-based machine learning models for biomedical research. We present our early undertaking in developing an ontology for capturing the conceptual-level information embedded in model card reports. RESULTS: Sourcing from existing ontologies and developing the core framework, we generated the Model Card Report Ontology. Our development efforts yielded an OWL2-based artifact that represents and formalizes model card report information. The current release of this ontology utilizes standard concepts and properties from OBO Foundry ontologies. Also, the software reasoner indicated no logical inconsistencies with the ontology. With sample model cards of machine learning models for bioinformatics research (HIV social networks and adverse outcome prediction for stent implantation), we showed the coverage and usefulness of our model in transforming static model card reports to a computable format for machine-based processing. CONCLUSIONS: The benefit of our work is that it utilizes expansive and standard terminologies and scientific rigor promoted by biomedical ontologists, as well as, generating an avenue to make model cards machine-readable using semantic web technology. Our future goal is to assess the veracity of our model and later expand the model to include additional concepts to address terminological gaps. We discuss tools and software that will utilize our ontology for potential application services.


Subject(s)
Biological Ontologies , Semantics , Artificial Intelligence , Computational Biology , Machine Learning , Software
9.
Bioinformatics ; 37(16): 2390-2397, 2021 Aug 25.
Article in English | MEDLINE | ID: mdl-33624749

ABSTRACT

MOTIVATION: Durbin's positional Burrows-Wheeler transform (PBWT) is a scalable data structure for haplotype matching. It has been successfully applied to identical by descent (IBD) segment identification and genotype imputation. Once the PBWT of a haplotype panel is constructed, it supports efficient retrieval of all shared long segments among all individuals (long matches) and efficient query between an external haplotype and the panel. However, the standard PBWT is an array-based static data structure and does not support dynamic updates of the panel. RESULTS: Here, we generalize the static PBWT to a dynamic data structure, d-PBWT, where the reverse prefix sorting at each position is stored with linked lists. We also developed efficient algorithms for insertion and deletion of individual haplotypes. In addition, we verified that d-PBWT can support all algorithms of PBWT. In doing so, we systematically investigated variations of set maximal match and long match query algorithms: while they all have average case time complexity independent of database size, they have different worst case complexities and dependencies on additional data structures. AVAILABILITYAND IMPLEMENTATION: The benchmarking code is available at genome.ucf.edu/d-PBWT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

10.
J Biomed Inform ; 133: 104166, 2022 09.
Article in English | MEDLINE | ID: mdl-35985620

ABSTRACT

Vancomycin is a commonly used antimicrobial in hospitals, and therapeutic drug monitoring (TDM) is required to optimize its efficacy and avoid toxicities. Bayesian models are currently recommended to predict the antibiotic levels. These models, however, although using carefully designed lab observations, were often developed in limited patient populations. The increasing availability of electronic health record (EHR) data offers an opportunity to develop TDM models for real-world patient populations. Here, we present a deep learning-based pharmacokinetic prediction model for vancomycin (PK-RNN-V E) using a large EHR dataset of 5,483 patients with 55,336 vancomycin administrations. PK-RNN-V E takes the patient's real-time sparse and irregular observations and offers dynamic predictions. Our results show that RNN-PK-V E offers a root mean squared error (RMSE) of 5.39 and outperforms the traditional Bayesian model (VTDM model) with an RMSE of 6.29. We believe that PK-RNN-V E can provide a pharmacokinetic model for vancomycin and other antimicrobials that require TDM.


Subject(s)
Deep Learning , Vancomycin , Bayes Theorem , Drug Monitoring/methods , Electronic Health Records , Humans , Vancomycin/therapeutic use
11.
BMC Biol ; 19(1): 32, 2021 02 16.
Article in English | MEDLINE | ID: mdl-33593342

ABSTRACT

BACKGROUND: The genealogical histories of individuals within populations are of interest to studies aiming both to uncover detailed pedigree information and overall quantitative population demographic histories. However, the analysis of quantitative details of individual genealogical histories has faced challenges from incomplete available pedigree records and an absence of objective and quantitative details in pedigree information. Although complete pedigree information for most individuals is difficult to track beyond a few generations, it is possible to describe a person's genealogical history using their genetic relatives revealed by identity by descent (IBD) segments-long genomic segments shared by two individuals within a population, which are identical due to inheritance from common ancestors. When modern biobanks collect genotype information for a significant fraction of a population, dense genetic connections of a person can be traced using such IBD segments, offering opportunities to characterize individuals in the context of the underlying populations. Here, we conducted an individual-centric analysis of IBD segments among the UK Biobank participants that represent 0.7% of the UK population. RESULTS: We made a high-quality call set of IBD segments over 5 cM among all 500,000 UK Biobank participants. On average, one UK individual shares IBD segments with 14,000 UK Biobank participants, which we refer to as "relatives." Using these segments, approximately 80% of a person's genome can be imputed. We subsequently propose genealogical descriptors based on the genetic connections of relative cohorts of individuals sharing at least one IBD segment and show that such descriptors offer important information about one's genetic makeup, personal genealogical history, and social behavior. Through analysis of relative counts sharing segments at different lengths, we identified a group, potentially British Jews, who has a distinct pattern of familial expansion history. Finally, using the enrichment of relatives in one's neighborhood, we identified regional variations of personal preference favoring living closer to one's extended families. CONCLUSIONS: Our analysis revealed genetic makeup, personal genealogical history, and social behaviors at the population scale, opening possibilities for further studies of individual's genetic connections in biobank data.


Subject(s)
Biological Specimen Banks/statistics & numerical data , Genealogy and Heraldry , Genetic Variation , Pedigree , Humans , United Kingdom
12.
Bioinformatics ; 35(14): i233-i241, 2019 07 15.
Article in English | MEDLINE | ID: mdl-31510689

ABSTRACT

MOTIVATION: With the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a pre-computed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin's Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple 'good enough' matches are desired. RESULTS: In this work, we developed two algorithmic extensions of Durbin's Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce 'virtual insertion' of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match blocks with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search. AVAILABILITY AND IMPLEMENTATION: genome.ucf.edu/pbwt-query. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Haplotypes , Algorithms , Genome , Genotype , Software
13.
J Am Coll Nutr ; 39(1): 47-53, 2020 01.
Article in English | MEDLINE | ID: mdl-31498715

ABSTRACT

Objective: To investigate gut microbial composition in Latino infants in relation to breastfeeding, obesity, and antibiotic exposure.Method: We analyzed the gut microbiome in 6-month-old Latino infants from an ongoing urban mother-child cohort. Alpha and beta diversity were assessed in relation to infants' early dietary exposure and anthropometrics including obesity.Results: Infants exclusively breastfed at 4 to 6 weeks had lower alpha diversity and less bacterial abundance compared with those who did not. Breastfeeding status at 4 to 6 weeks and 6 months of age accounted for differences in alpha and beta diversity. Infants who were obese at 6 months of age had higher levels of alpha diversity compared with non-obese infants.Conclusions: Early exclusive breastfeeding and obesity impacts microbial diversity by 6 months of age in Latino infants, a group at high risk for future obesity.


Subject(s)
Feeding Behavior/physiology , Gastrointestinal Microbiome/genetics , Hispanic or Latino/statistics & numerical data , Pediatric Obesity/ethnology , Pediatric Obesity/microbiology , Anthropometry , Anti-Bacterial Agents/adverse effects , Breast Feeding , Dietary Exposure/adverse effects , Feces/microbiology , Female , Humans , Infant , Linear Models , Male , RNA, Ribosomal, 16S/analysis
14.
Neurosurg Focus ; 48(5): E4, 2020 05 01.
Article in English | MEDLINE | ID: mdl-32357322

ABSTRACT

OBJECTIVE: Subarachnoid hemorrhage (SAH) is a devastating cerebrovascular condition, not only due to the effect of initial hemorrhage, but also due to the complication of delayed cerebral ischemia (DCI). While hypertension facilitated by vasopressors is often initiated to prevent DCI, which vasopressor is most effective in improving outcomes is not known. The objective of this study was to determine associations between initial vasopressor choice and mortality in patients with nontraumatic SAH. METHODS: The authors conducted a retrospective cohort study using a large, national electronic medical record data set from 2000-2014 to identify patients with a new diagnosis of nontraumatic SAH (based on ICD-9 codes) who were treated with the vasopressors dopamine, phenylephrine, or norepinephrine. The relationship between the initial choice of vasopressor therapy and the primary outcome, which was defined as in-hospital death or discharge to hospice care, was examined. RESULTS: In total, 2634 patients were identified with nontraumatic SAH who were treated with a vasopressor. In this cohort, the average age was 56.5 years, 63.9% were female, and 36.5% of patients developed the primary outcome. The incidence of the primary outcome was higher in those initially treated with either norepinephrine (47.6%) or dopamine (50.6%) than with phenylephrine (24.5%). After adjusting for possible confounders using propensity score methods, the adjusted OR of the primary outcome was higher with dopamine (OR 2.19, 95% CI 1.70-2.81) and norepinephrine (OR 2.24, 95% CI 1.80-2.80) compared with phenylephrine. Sensitivity analyses using different variable selection procedures, causal inference models, and machine-learning methods confirmed the main findings. CONCLUSIONS: In patients with nontraumatic SAH, phenylephrine was significantly associated with reduced mortality in SAH patients compared to dopamine or norepinephrine. Prospective randomized clinical studies are warranted to confirm this finding.


Subject(s)
Dopamine/therapeutic use , Electronic Health Records , Norepinephrine/therapeutic use , Phenylephrine/therapeutic use , Subarachnoid Hemorrhage/drug therapy , Vasoconstrictor Agents/therapeutic use , Adult , Aged , Female , Glasgow Coma Scale , Hospital Mortality , Humans , Logistic Models , Male , Middle Aged , Patient Discharge/statistics & numerical data , Retrospective Studies , Subarachnoid Hemorrhage/complications , Subarachnoid Hemorrhage/mortality
15.
J Med Internet Res ; 22(7): e16981, 2020 07 31.
Article in English | MEDLINE | ID: mdl-32735224

ABSTRACT

BACKGROUND: Asthma exacerbation is an acute or subacute episode of progressive worsening of asthma symptoms and can have a significant impact on patients' quality of life. However, efficient methods that can help identify personalized risk factors and make early predictions are lacking. OBJECTIVE: This study aims to use advanced deep learning models to better predict the risk of asthma exacerbations and to explore potential risk factors involved in progressive asthma. METHODS: We proposed a novel time-sensitive, attentive neural network to predict asthma exacerbation using clinical variables from large electronic health records. The clinical variables were collected from the Cerner Health Facts database between 1992 and 2015, including 31,433 adult patients with asthma. Interpretations on both patient and cohort levels were investigated based on the model parameters. RESULTS: The proposed model obtained an area under the curve value of 0.7003 through a five-fold cross-validation, which outperformed the baseline methods. The results also demonstrated that the addition of elapsed time embeddings considerably improved the prediction performance. Further analysis observed diverse distributions of contributing factors across patients as well as some possible cohort-level risk factors, which could be found supporting evidence from peer-reviewed literature such as respiratory diseases and esophageal reflux. CONCLUSIONS: The proposed neural network model performed better than previous methods for the prediction of asthma exacerbation. We believe that personalized risk scores and analyses of contributing factors can help clinicians better assess the individual's level of disease progression and afford the opportunity to adjust treatment, prevent exacerbation, and improve outcomes.


Subject(s)
Asthma/physiopathology , Deep Learning/standards , Neural Networks, Computer , Quality of Life/psychology , Disease Progression , Female , Humans , Male , Retrospective Studies , Risk Assessment , Risk Factors
16.
BMC Bioinformatics ; 20(Suppl 11): 279, 2019 Jun 06.
Article in English | MEDLINE | ID: mdl-31167638

ABSTRACT

BACKGROUND: Recent advances in whole-genome sequencing and SNP array technology have led to the generation of a large amount of genotype data. Large volumes of genotype data will require faster and more efficient methods for storing and searching the data. Positional Burrows-Wheeler Transform (PBWT) provides an appropriate data structure for bi-allelic data. With the increasing sample sizes, more multi-allelic sites are expected to be observed. Hence, there is a necessity to handle multi-allelic genotype data. RESULTS: In this paper, we introduce a multi-allelic version of the Positional Burrows-Wheeler Transform (mPBWT) based on the bi-allelic version for compression and searching. The time-complexity for constructing the data structure and searching within a panel containing t-allelic sites increases by a factor of t. CONCLUSION: Considering the small value for the possible alleles t, the time increase for the multi-allelic PBWT will be negligible and comparable to the bi-allelic version of PBWT.


Subject(s)
Algorithms , Alleles , Data Compression , Genes , Haplotypes/genetics , Humans , Time Factors
17.
BMC Genomics ; 20(Suppl 1): 82, 2019 Feb 04.
Article in English | MEDLINE | ID: mdl-30712510

ABSTRACT

BACKGROUND: Existing functional description of genes are categorical, discrete, and mostly through manual process. In this work, we explore the idea of gene embedding, distributed representation of genes, in the spirit of word embedding. RESULTS: From a pure data-driven fashion, we trained a 200-dimension vector representation of all human genes, using gene co-expression patterns in 984 data sets from the GEO databases. These vectors capture functional relatedness of genes in terms of recovering known pathways - the average inner product (similarity) of genes within a pathway is 1.52X greater than that of random genes. Using t-SNE, we produced a gene co-expression map that shows local concentrations of tissue specific genes. We also illustrated the usefulness of the embedded gene vectors, laden with rich information on gene co-expression patterns, in tasks such as gene-gene interaction prediction. CONCLUSIONS: We proposed a machine learning method that utilizes transcriptome-wide gene co-expression to generate a distributed representation of genes. We further demonstrated the utility of our distribution by predicting gene-gene interaction based solely on gene names. The distributed representation of genes could be useful for more bioinformatics applications.


Subject(s)
Computational Biology/methods , Software , Algorithms , Computational Biology/standards , Epistasis, Genetic , Gene Expression Profiling/methods , Gene Expression Regulation , Humans , ROC Curve , Transcriptome , User-Computer Interface
18.
BMC Genomics ; 20(Suppl 1): 80, 2019 Feb 04.
Article in English | MEDLINE | ID: mdl-30712512

ABSTRACT

The sixth International Conference on Intelligent Biology and Medicine (ICIBM) took place in Los Angeles, California, USA on June 10-12, 2018. This conference featured eleven regular scientific sessions, four tutorials, one poster session, four keynote talks, and four eminent scholar talks. The scientific program covered a wide range of topics from bench to bedside, including 3D Genome Organization, reconstruction of large scale evolution of genomes and gene functions, artificial intelligence in biological and biomedical fields, and precision medicine. Both method development and application in genomic research continued to be a main component in the conference, including studies on genetic variants, regulation of transcription, genetic-epigenetic interaction at both single cell and tissue level and artificial intelligence. Here, we write a summary of the conference and also briefly introduce the four high quality papers selected to be published in BMC Genomics that cover novel methodology development or innovative data analysis.


Subject(s)
Artificial Intelligence , Biology , Medicine , Biology/methods , Humans , Medicine/methods
19.
Pharmacogenomics J ; 19(1): 97-108, 2019 02.
Article in English | MEDLINE | ID: mdl-29855607

ABSTRACT

We evaluated interactions of SNP-by-ACE-I/ARB and SNP-by-TD on serum potassium (K+) among users of antihypertensive treatments (anti-HTN). Our study included seven European-ancestry (EA) (N = 4835) and four African-ancestry (AA) cohorts (N = 2016). We performed race-stratified, fixed-effect, inverse-variance-weighted meta-analyses of 2.5 million SNP-by-drug interaction estimates; race-combined meta-analysis; and trans-ethnic fine-mapping. Among EAs, we identified 11 significant SNPs (P < 5 × 10-8) for SNP-ACE-I/ARB interactions on serum K+ that were located between NR2F1-AS1 and ARRDC3-AS1 on chromosome 5 (top SNP rs6878413 P = 1.7 × 10-8; ratio of serum K+ in ACE-I/ARB exposed compared to unexposed is 1.0476, 1.0280, 1.0088 for the TT, AT, and AA genotypes, respectively). Trans-ethnic fine mapping identified the same group of SNPs on chromosome 5 as genome-wide significant for the ACE-I/ARB analysis. In conclusion, SNP-by-ACE-I /ARB interaction analyses uncovered loci that, if replicated, could have future implications for the prevention of arrhythmias due to anti-HTN treatment-related hyperkalemia. Before these loci can be identified as clinically relevant, future validation studies of equal or greater size in comparison to our discovery effort are needed.


Subject(s)
Black or African American/genetics , Peptidyl-Dipeptidase A/genetics , Polymorphism, Single Nucleotide/genetics , Potassium/blood , Sodium Chloride Symporter Inhibitors/therapeutic use , White People/genetics , Aged , Antihypertensive Agents/therapeutic use , Chromosomes, Human, Pair 5/genetics , Europe , Female , Genome-Wide Association Study/methods , Genotype , Humans , Male , Middle Aged
20.
BMC Med Inform Decis Mak ; 19(Suppl 2): 58, 2019 04 09.
Article in English | MEDLINE | ID: mdl-30961579

ABSTRACT

BACKGROUND: Learning distributional representation of clinical concepts (e.g., diseases, drugs, and labs) is an important research area of deep learning in the medical domain. However, many existing relevant methods do not consider temporal dependencies along the longitudinal sequence of a patient's records, which may lead to incorrect selection of contexts. METHODS: To address this issue, we extended three popular concept embedding learning methods: word2vec, positive pointwise mutual information (PPMI) and FastText, to consider time-sensitive information. We then trained them on a large electronic health records (EHR) database containing about 50 million patients to generate concept embeddings and evaluated them for both intrinsic evaluations focusing on concept similarity measure and an extrinsic evaluation to assess the use of generated concept embeddings in the task of predicting disease onset. RESULTS: Our experiments show that embeddings learned from information within one visit (time window zero) improve performance on the concept similarity measure and the FastText algorithm usually had better performance than the other two algorithms. For the predictive modeling task, the optimal result was achieved by word2vec embeddings with a 30-day sliding window. CONCLUSIONS: Considering time constraints are important in training clinical concept embeddings. We expect they can benefit a series of downstream applications.


Subject(s)
Deep Learning , Electronic Health Records , Algorithms , Databases, Factual , Humans , Information Storage and Retrieval , Time Factors
SELECTION OF CITATIONS
SEARCH DETAIL