Your browser doesn't support javascript.
loading
Montrer: 20 | 50 | 100
Résultats 1 - 13 de 13
Filtrer
1.
Cancers (Basel) ; 15(9)2023 Apr 25.
Article de Anglais | MEDLINE | ID: mdl-37173930

RÉSUMÉ

Despite the unprecedented performance of deep neural networks (DNNs) in computer vision, their clinical application in the diagnosis and prognosis of cancer using medical imaging has been limited. One of the critical challenges for integrating diagnostic DNNs into radiological and oncological applications is their lack of interpretability, preventing clinicians from understanding the model predictions. Therefore, we studied and propose the integration of expert-derived radiomics and DNN-predicted biomarkers in interpretable classifiers, which we refer to as ConRad, for computerized tomography (CT) scans of lung cancer. Importantly, the tumor biomarkers can be predicted from a concept bottleneck model (CBM) such that once trained, our ConRad models do not require labor-intensive and time-consuming biomarkers. In our evaluation and practical application, the only input to ConRad is a segmented CT scan. The proposed model was compared to convolutional neural networks (CNNs) which act as a black box classifier. We further investigated and evaluated all combinations of radiomics, predicted biomarkers and CNN features in five different classifiers. We found the ConRad models using nonlinear SVM and the logistic regression with the Lasso outperformed the others in five-fold cross-validation, with the interpretability of ConRad being its primary advantage. The Lasso is used for feature selection, which substantially reduces the number of nonzero weights while increasing the accuracy. Overall, the proposed ConRad model combines CBM-derived biomarkers and radiomics features in an interpretable ML model which demonstrates excellent performance for lung nodule malignancy classification.

3.
J Mol Cell Cardiol ; 145: 54-58, 2020 08.
Article de Anglais | MEDLINE | ID: mdl-32504647

RÉSUMÉ

OBJECTIVE: During cardiovascular disease progression, molecular systems of myocardium (e.g., a proteome) undergo diverse and distinct changes. Dynamic, temporally-regulated alterations of individual molecules underlie the collective response of the heart to pathological drivers and the ultimate development of pathogenesis. Advances in high-throughput omics technologies have enabled cost-effective, temporal profiling of targeted systems in animal models of human diseases. However, computational analysis of temporal patterns from omics data remains challenging. In particular, bioinformatic pipelines involving unsupervised statistical approaches to support cardiovascular investigations are lacking, which hinders one's ability to extract biomedical insights from these complex datasets. APPROACH AND RESULTS: We developed a non-parametric data analysis platform to resolve computational challenges unique to temporal omics datasets. Our platform consists of three modules. Module I preprocesses the temporal data using either cubic splines or principal component analysis (PCA), and it simultaneously accomplishes the tasks on missing data imputation and denoising. Module II performs an unsupervised classification by K-means or hierarchical clustering. Module III evaluates and identifies biological entities (e.g., molecular events) that exhibit strong associations to specific temporal patterns. The jackstraw method for cluster membership has been applied to estimate p-values and posterior inclusion probabilities (PIPs), both of which guided feature selection. To demonstrate the utility of the analysis platform, we employed a temporal proteomics dataset that captured the proteome-wide dynamics of oxidative stress induced post-translational modifications (O-PTMs) in mouse hearts undergoing isoproterenol (ISO)-induced hypertrophy. CONCLUSION: We have created a platform, CV.Signature.TCP, to identify distinct temporal clusters in omics datasets. We presented a cardiovascular use case to demonstrate its utility in unveiling biological insights underlying O-PTM regulations in cardiac remodeling. This platform is implemented in an open source R package (https://github.com/UCLA-BD2K/CV.Signature.TCP).


Sujet(s)
Maladies cardiovasculaires/génétique , Science des données , Analyse de profil d'expression de gènes , Animaux , Analyse de regroupements , Cystéine/métabolisme , Humains , Maturation post-traductionnelle des protéines , Facteurs temps
4.
Bioinformatics ; 36(10): 3107-3114, 2020 05 01.
Article de Anglais | MEDLINE | ID: mdl-32142108

RÉSUMÉ

MOTIVATION: Single-cell RNA-sequencing (scRNA-seq) allows us to dissect transcriptional heterogeneity arising from cellular types, spatio-temporal contexts and environmental stimuli. Transcriptional heterogeneity may reflect phenotypes and molecular signatures that are often unmeasured or unknown a priori. Cell identities of samples derived from heterogeneous subpopulations are then determined by clustering of scRNA-seq data. These cell identities are used in downstream analyses. How can we examine if cell identities are accurately inferred? Unlike external measurements or labels for single cells, using clustering-based cell identities result in spurious signals and false discoveries. RESULTS: We introduce non-parametric methods to evaluate cell identities by testing cluster memberships in an unsupervised manner. Diverse simulation studies demonstrate accuracy of the jackstraw test for cluster membership. We propose a posterior probability that a cell should be included in that clustering-based subpopulation. Posterior inclusion probabilities (PIPs) for cluster memberships can be used to select and visualize samples relevant to subpopulations. The proposed methods are applied on three scRNA-seq datasets. First, a mixture of Jurkat and 293T cell lines provides two distinct cellular populations. Second, Cell Hashing yields cell identities corresponding to eight donors which are independently analyzed by the jackstraw. Third, peripheral blood mononuclear cells are used to explore heterogeneous immune populations. The proposed P-values and PIPs lead to probabilistic feature selection of single cells that can be visualized using principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE) and others. By learning uncertainty in clustering high-dimensional data, the proposed methods enable unsupervised evaluation of cluster membership. AVAILABILITY AND IMPLEMENTATION: https://cran.r-project.org/package=jackstraw. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Sujet(s)
Analyse de profil d'expression de gènes , Agranulocytes , Algorithmes , Analyse de regroupements , Analyse de séquence d'ARN , Analyse sur cellule unique
5.
BMC Bioinformatics ; 20(Suppl 15): 644, 2019 Dec 24.
Article de Anglais | MEDLINE | ID: mdl-31874610

RÉSUMÉ

BACKGROUND: A survey of presences and absences of specific species across multiple biogeographic units (or bioregions) are used in a broad area of biological studies from ecology to microbiology. Using binary presence-absence data, we evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify statistically significant Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of species. However, statistical hypothesis testing using this similarity coefficient has been seldom used or studied. RESULTS: We introduce a hypothesis test for similarity for biological presence-absence data, using the Jaccard/Tanimoto coefficient. Several key improvements are presented including unbiased estimation of expectation and centered Jaccard/Tanimoto coefficients, that account for occurrence probabilities. The exact and asymptotic solutions are derived. To overcome a computational burden due to high-dimensionality, we propose the bootstrap and measurement concentration algorithms to efficiently estimate statistical significance of binary similarity. Comprehensive simulation studies demonstrate that our proposed methods produce accurate p-values and false discovery rates. The proposed estimation methods are orders of magnitude faster than the exact solution, particularly with an increasing dimensionality. We showcase their applications in evaluating co-occurrences of bird species in 28 islands of Vanuatu and fish species in 3347 freshwater habitats in France. The proposed methods are implemented in an open source R package called jaccard (https://cran.r-project.org/package=jaccard). CONCLUSION: We introduce a suite of statistical methods for the Jaccard/Tanimoto similarity coefficient for binary data, that enable straightforward incorporation of probabilistic measures in analysis for species co-occurrences. Due to their generality, the proposed methods and implementations are applicable to a wide range of binary data arising from genomics, biochemistry, and other areas of science.


Sujet(s)
Biologie d'eau douce/méthodes , Algorithmes , Animaux , Biométrie , Poissons , Probabilité
6.
J Comput Biol ; 26(8): 782-793, 2019 08.
Article de Anglais | MEDLINE | ID: mdl-31045436

RÉSUMÉ

The development of single cell RNA sequencing (scRNA-seq) has enabled innovative approaches to investigating mRNA abundances. In our study, we are interested in extracting the systematic patterns of scRNA-seq data in an unsupervised manner; thus, we have developed two extensions of robust principal component analysis (RPCA). First, we present a truncated version of RPCA (tRPCA), which is much faster and memory efficient. Second, we introduce a noise reduction in tRPCA with L 2 regularization. Unlike RPCA that only considers a low-rank L and sparse S matrices, the proposed method can also extract a noise E matrix inherent in modern genomic data. We demonstrate its usefulness by applying our methods on the peripheral blood mononuclear cell scRNA-seq data. Particularly, the clustering of a low-rank L matrix showcases better classification of unlabeled single cells. Overall, the proposed variants are well suited for high-dimensional and noisy data that are routinely generated in genomics.


Sujet(s)
Algorithmes , Bases de données d'acides nucléiques , Analyse de séquence d'ARN , Analyse sur cellule unique , Humains
7.
Methods ; 166: 66-73, 2019 08 15.
Article de Anglais | MEDLINE | ID: mdl-30853547

RÉSUMÉ

Integration of multi-omics in cardiovascular diseases (CVDs) presents high potentials for translational discoveries. By analyzing abundance levels of heterogeneous molecules over time, we may uncover biological interactions and networks that were previously unidentifiable. However, to effectively perform integrative analysis of temporal multi-omics, computational methods must account for the heterogeneity and complexity in the data. To this end, we performed unsupervised classification of proteins and metabolites in mice during cardiac remodeling using two innovative deep learning (DL) approaches. First, long short-term memory (LSTM)-based variational autoencoder (LSTM-VAE) was trained on time-series numeric data. The low-dimensional embeddings extracted from LSTM-VAE were then used for clustering. Second, deep convolutional embedded clustering (DCEC) was applied on images of temporal trends. Instead of a two-step procedure, DCEC performes a joint optimization for image reconstruction and cluster assignment. Additionally, we performed K-means clustering, partitioning around medoids (PAM), and hierarchical clustering. Pathway enrichment analysis using the Reactome knowledgebase demonstrated that DL methods yielded higher numbers of significant biological pathways than conventional clustering algorithms. In particular, DCEC resulted in the highest number of enriched pathways, suggesting the strength of its unified framework based on visual similarities. Overall, unsupervised DL is shown to be a promising analytical approach for integrative analysis of temporal multi-omics.


Sujet(s)
Biologie informatique/méthodes , Apprentissage profond , Ventricules cardiaques/imagerie diagnostique , Remodelage ventriculaire/physiologie , Algorithmes , Analyse de regroupements , Ventricules cardiaques/ultrastructure , Traitement d'image par ordinateur
8.
Genes (Basel) ; 10(2)2019 01 28.
Article de Anglais | MEDLINE | ID: mdl-30696086

RÉSUMÉ

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues.


Sujet(s)
Mégadonnées , Biologie informatique/méthodes , Apprentissage machine , Animaux , Biologie informatique/normes , Humains
9.
Nat Commun ; 9(1): 2656, 2018 07 09.
Article de Anglais | MEDLINE | ID: mdl-29985403

RÉSUMÉ

Genome-wide analysis of transcription in the malaria parasite Plasmodium falciparum has revealed robust variation in steady-state mRNA abundance throughout the 48-h intraerythrocytic developmental cycle (IDC), suggesting that this process is highly dynamic and tightly regulated. Here, we utilize rapid 4-thiouracil (4-TU) incorporation via pyrimidine salvage to specifically label, capture, and quantify newly-synthesized RNA transcripts at every hour throughout the IDC. This high-resolution global analysis of the transcriptome captures the timing and rate of transcription for each newly synthesized mRNA in vivo, revealing active transcription throughout all IDC stages. Using a statistical model to predict the mRNA dynamics contributing to the total mRNA abundance at each timepoint, we find varying degrees of transcription and stabilization for each mRNA corresponding to developmental transitions. Finally, our results provide new insight into co-regulation of mRNAs throughout the IDC through regulatory DNA sequence motifs, thereby expanding our understanding of P. falciparum mRNA dynamics.


Sujet(s)
Gènes de protozoaire/génétique , Génome de protozoaire/génétique , Plasmodium falciparum/génétique , Transcription génétique , Érythrocytes/parasitologie , Analyse de profil d'expression de gènes , Gene Ontology , Humains , Paludisme à Plasmodium falciparum/parasitologie , Plasmodium falciparum/physiologie , ARN messager/génétique , ARN messager/métabolisme , ARN des protozoaires/génétique , ARN des protozoaires/métabolisme
10.
Trop Med Int Health ; 22(3): 332-339, 2017 03.
Article de Anglais | MEDLINE | ID: mdl-28102027

RÉSUMÉ

OBJECTIVE: To describe engagement along the HIV continuum of care using a large network of clinics in Zambia. METHODS: We employed a practical framework to describe retention along the HIV treatment cascade, using routinely collected clinical data available in resource-constrained settings. We included health facilities in four Zambian provinces with more than 300 enrolled patients over the age of 5 years. We described attrition at each step, from HIV enrolment to 720 days after ART initiation. The population was further stratified by year of enrolment to describe temporal trends in patient engagement. RESULTS: From January 2004 to December 2014, 444 439 individuals over the age of 5 years sought HIV care at 75 eligible health facilities. Among those enrolled into HIV care, 82.1% (95% confidence interval [CI]: 79.4-84.5%) were fully assessed for ART eligibility within 180 days of enrolment and 63.6% (95% CI: 61.7-65.3) were found to be eligible for ART based on the HIV treatment guidelines at the time. Of those patients eligible for ART, 81.1% (95% CI: 79.5-82.7%) initiated ART within 180 days. Patient retention in ART programme was 81.2% (95% CI: 80.4-81.9%) at 90 days, 70.0% (95% CI: 68.7-71.2%) at 360 days and 61.6% (95% CI: 60.0-63.2%) at 720 days. We noted a steady decline in proportions assessed for ART eligibility and deemed eligible for ART in the time frame. Proportions that started ART and remained in care remained relatively consistent. CONCLUSION: We describe a simple approach for assessing patient engagement after enrolment into HIV care. Using limited types of data routinely available, we demonstrate an important and replicable approach to monitoring programmes in resource-constrained settings.


Sujet(s)
Agents antiVIH/usage thérapeutique , Continuité des soins , Infections à VIH/traitement médicamenteux , Établissements de santé , Abandon des soins par les patients/statistiques et données numériques , Adulte , Femelle , Ressources en santé , Humains , Mâle , Acceptation des soins par les patients , Évaluation de programme/méthodes , Zambie
11.
Sci Rep ; 7: 40688, 2017 01 13.
Article de Anglais | MEDLINE | ID: mdl-28084449

RÉSUMÉ

Since domestication, population bottlenecks, breed formation, and selective breeding have radically shaped the genealogy and genetics of Bos taurus. In turn, characterization of population structure among diverse bull (males of Bos taurus) genomes enables detailed assessment of genetic resources and origins. By analyzing 432 unrelated bull genomes from 13 breeds and 16 countries, we demonstrate genetic diversity and structural complexity among the European/Western cattle population. Importantly, we relaxed a strong assumption of discrete or admixed population, by adapting latent variable models for individual-specific allele frequencies that directly capture a wide range of complex structure from genome-wide genotypes. As measured by magnitude of differentiation, selection pressure on SNPs within genes is substantially greater than that on intergenic regions. Additionally, broad regions of chromosome 6 harboring largest genetic differentiation suggest positive selection underlying population structure. We carried out gene set analysis using SNP annotations to identify enriched functional categories such as energy-related processes and multiple development stages. Our population structure analysis of bull genomes can support genetic management strategies that capture structural complexity and promote sustainable genetic breadth.


Sujet(s)
Génétique des populations , Génome , Génomique , Animaux , Sélection , Bovins , Analyse de regroupements , Biologie informatique/méthodes , Variation génétique , Génomique/méthodes , Annotation de séquence moléculaire , Polymorphisme de nucléotide simple , Sélection génétique , Analyse de séquence d'ADN
12.
Bioinformatics ; 31(4): 545-54, 2015 Feb 15.
Article de Anglais | MEDLINE | ID: mdl-25336500

RÉSUMÉ

MOTIVATION: There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting. RESULTS: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses. AVAILABILITY AND IMPLEMENTATION: An R software package, called jackstraw, is available in CRAN. CONTACT: jstorey@princeton.edu.


Sujet(s)
Algorithmes , Interprétation statistique de données , Génomique/méthodes , Inflammation/génétique , Modèles statistiques , Analyse en composantes principales , Logiciel , Simulation numérique , Analyse de profil d'expression de gènes , Gènes cdc , Variation génétique , Humains , Analyse sur microréseau , Phénotype , Saccharomyces cerevisiae/génétique , Protéines de Saccharomyces cerevisiae/génétique , Troubles de stress post-traumatique/génétique
13.
Genome Med ; 6(5): 40, 2014.
Article de Anglais | MEDLINE | ID: mdl-24971157

RÉSUMÉ

BACKGROUND: Genetic risk scores have been developed for coronary artery disease and atherosclerosis, but are not predictive of adverse cardiovascular events. We asked whether peripheral blood expression profiles may be predictive of acute myocardial infarction (AMI) and/or cardiovascular death. METHODS: Peripheral blood samples from 338 subjects aged 62 ± 11 years with coronary artery disease (CAD) were analyzed in two phases (discovery N = 175, and replication N = 163), and followed for a mean 2.4 years for cardiovascular death. Gene expression was measured on Illumina HT-12 microarrays with two different normalization procedures to control technical and biological covariates. Whole genome genotyping was used to support comparative genome-wide association studies of gene expression. Analysis of variance was combined with receiver operating curve and survival analysis to define a transcriptional signature of cardiovascular death. RESULTS: In both phases, there was significant differential expression between healthy and AMI groups with overall down-regulation of genes involved in T-lymphocyte signaling and up-regulation of inflammatory genes. Expression quantitative trait loci analysis provided evidence for altered local genetic regulation of transcript abundance in AMI samples. On follow-up there were 31 cardiovascular deaths. A principal component (PC1) score capturing covariance of 238 genes that were differentially expressed between deceased and survivors in the discovery phase significantly predicted risk of cardiovascular death in the replication and combined samples (hazard ratio = 8.5, P < 0.0001) and improved the C-statistic (area under the curve 0.82 to 0.91, P = 0.03) after adjustment for traditional covariates. CONCLUSIONS: A specific blood gene expression profile is associated with a significant risk of death in Caucasian subjects with CAD. This comprises a subset of transcripts that are also altered in expression during acute myocardial infarction.

SÉLECTION CITATIONS
DÉTAIL DE RECHERCHE
...