Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 64
Filter
Add more filters

Country/Region as subject
Publication year range
1.
BMC Bioinformatics ; 24(1): 256, 2023 Jun 17.
Article in English | MEDLINE | ID: mdl-37330471

ABSTRACT

BACKGROUND: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. RESULTS: We avoid the crude approximations entailed by such aggregation through proposing an independent Poisson distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. CONCLUSIONS: This new method has multiple advantages, including (1) no need for prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson.


Subject(s)
RNA , Single-Cell Analysis , Sequence Analysis, RNA/methods , Poisson Distribution , Single-Cell Analysis/methods , Cluster Analysis , RNA/genetics , Gene Expression Profiling/methods
2.
Osteoarthritis Cartilage ; 27(7): 994-1001, 2019 07.
Article in English | MEDLINE | ID: mdl-31002938

ABSTRACT

OBJECTIVE: Knee osteoarthritis (KOA) is a heterogeneous condition representing a variety of potentially distinct phenotypes. The purpose of this study was to apply innovative machine learning approaches to KOA phenotyping in order to define progression phenotypes that are potentially more responsive to interventions. DESIGN: We used publicly available data from the Foundation for the National Institutes of Health (FNIH) osteoarthritis (OA) Biomarkers Consortium, where radiographic (medial joint space narrowing of ≥0.7 mm), and pain progression (increase of ≥9 Western Ontario and McMaster Universities Osteoarthritis Index [WOMAC] points) were defined at 48 months, as four mutually exclusive outcome groups (none, both, pain only, radiographic only), along with an extensive set of covariates. We applied distance weighted discrimination (DWD), direction-projection-permutation (DiProPerm) testing, and clustering methods to focus on the contrast (z-scores) between those progressing by both criteria ("progressors") and those progressing by neither ("non-progressors"). RESULTS: Using all observations (597 individuals, 59% women, mean age 62 years and BMI 31 kg/m2) and all 73 baseline variables available in the dataset, there was a clear separation among progressors and non-progressors (z = 10.1). Higher z-scores were seen for the magnetic resonance imaging (MRI)-based variables than for demographic/clinical variables or biochemical markers. Baseline variables with the greatest contribution to non-progression at 48 months included WOMAC pain, lateral meniscal extrusion, and serum N-terminal pro-peptide of collagen IIA (PIIANP), while those contributing to progression included bone marrow lesions, osteophytes, medial meniscal extrusion, and urine C-terminal crosslinked telopeptide type II collagen (CTX-II). CONCLUSIONS: Using methods that provide a way to assess numerous variables of different types and scalings simultaneously in relation to an outcome of interest enabled a data-driven approach that identified key variables associated with a progression phenotype.


Subject(s)
Biological Variation, Population/genetics , Cartilage, Articular/pathology , Machine Learning , Osteoarthritis, Knee/genetics , Osteoarthritis, Knee/pathology , Aged , Biomarkers/blood , Cartilage, Articular/diagnostic imaging , Cartilage, Articular/physiopathology , Collagen Type II/blood , Congresses as Topic , Databases, Factual , Disease Progression , Female , Humans , Male , Menisci, Tibial/pathology , Middle Aged , National Institutes of Health (U.S.) , Osteoarthritis, Knee/diagnostic imaging , Pain Measurement , Severity of Illness Index , United States
3.
Biometrics ; 74(2): 439-447, 2018 06.
Article in English | MEDLINE | ID: mdl-28853138

ABSTRACT

Genotype eigenvectors are widely used as covariates for control of spurious stratification in genetic association. Significance testing for the accompanying eigenvalues has typically been based on a standard Tracy-Widom limiting distribution for the largest eigenvalue, derived under white-noise assumptions. It is known that even modest local correlation among markers inflates the largest eigenvalues, even in the absence of true stratification. In addition, a few sample eigenvalues may be extreme, creating further complications in accurate testing. We explore several methods to identify appropriate null eigenvalue thresholds, while remaining sensitive to eigenvalues corresponding to population stratification. We introduce a novel block permutation approach, designed to produce an appropriate null eigenvalue distribution by eliminating long-range genomic correlation while preserving local correlation. We also propose a fast approach based on eigenvalue distribution modeling, using a simple fit criterion and the general Marcenko-Pastur equation under a simple discrete eigenvalue model. Block permutation and the model-based approach work well for pure simulations and for data resampled from the 1000 Genomes project. In contrast, we find that the standard approach of computing an "effective" number of markers does not perform well. The performance of the methods is also demonstrated for a motivating example from the International Cystic Fibrosis Consortium.


Subject(s)
Genetic Association Studies/methods , Models, Statistical , Computer Simulation , Cystic Fibrosis/genetics , Data Interpretation, Statistical , Genomics/methods , Genotype , Humans , Models, Genetic
4.
Neuroimage ; 152: 38-49, 2017 05 15.
Article in English | MEDLINE | ID: mdl-28246033

ABSTRACT

A major goal in neuroscience is to understand the neural pathways underlying human behavior. We introduce the recently developed Joint and Individual Variation Explained (JIVE) method to the neuroscience community to simultaneously analyze imaging and behavioral data from the Human Connectome Project. Motivated by recent computational and theoretical improvements in the JIVE approach, we simultaneously explore the joint and individual variation between and within imaging and behavioral data. In particular, we demonstrate that JIVE is an effective and efficient approach for integrating task fMRI and behavioral variables using three examples: one example where task variation is strong, one where task variation is weak and a reference case where the behavior is not directly related to the image. These examples are provided to visualize the different levels of signal found in the joint variation including working memory regions in the image data and accuracy and response time from the in-task behavioral variables. Joint analysis provides insights not available from conventional single block decomposition methods such as Singular Value Decomposition. Additionally, the joint variation estimated by JIVE appears to more clearly identify the working memory regions than Partial Least Squares (PLS), while Canonical Correlation Analysis (CCA) gives grossly overfit results. The individual variation in JIVE captures the behavior unrelated signals such as a background activation that is spatially homogeneous and activation in the default mode network. The information revealed by this individual variation is not examined in traditional methods such as CCA and PLS. We suggest that JIVE can be used as an alternative to PLS and CCA to improve estimation of the signal common to two or more datasets and reveal novel insights into the signal unique to each dataset.


Subject(s)
Brain/anatomy & histology , Brain/physiology , Connectome/methods , Adult , Humans , Image Processing, Computer-Assisted , Magnetic Resonance Imaging , Signal Processing, Computer-Assisted , Software , Young Adult
5.
Osteoarthritis Cartilage ; 24(4): 640-6, 2016 Apr.
Article in English | MEDLINE | ID: mdl-26620089

ABSTRACT

INTRODUCTION: Hip shape is a risk factor for the development of hip osteoarthritis (OA), and current methods to assess hip shape from radiographs are limited; therefore this study explored current and novel methods to assess hip shape. METHODS: Data from a prior case-control study nested in the Johnston County OA Project were used, including 382 hips (from 342 individuals). Hips were classified by radiographic hip OA (RHOA) status as RHOA cases (baseline Kellgren Lawrence grade [KLG] 0 or 1, follow-up [mean 6 years] KLG ≥ 2) or controls (KLG = 0 or 1 at both baseline and follow-up). Proximal femur shape was assessed using a 60-point model as previously described. The current analysis explored commonly used principal component analysis (PCA), as well as novel statistical methodologies suited to high dimension low sample size settings (Distance Weighted Discrimination [DWD] and Distance Projection Permutation [DiProPerm] hypothesis testing) to assess differences between cases and controls. RESULTS: Using these novel methodologies, we were able to better characterize morphologic differences by sex and race. In particular, the proximal femurs of African American women demonstrated significantly different shapes between cases and controls, implying an important role for sex and race in the development of RHOA. Notably, discrimination was improved with the use of DWD and DiProPerm compared to PCA. CONCLUSIONS: DWD with DiProPerm significance testing provides improved discrimination of variation in hip morphology between groups, and enables subgroup analyses even under small sample sizes.


Subject(s)
Black or African American/statistics & numerical data , Hip Joint/pathology , Osteoarthritis, Hip/ethnology , Osteoarthritis, Hip/pathology , Aged , Case-Control Studies , Data Interpretation, Statistical , Female , Femur/diagnostic imaging , Femur/pathology , Hip Joint/diagnostic imaging , Humans , Male , Middle Aged , North Carolina/epidemiology , Osteoarthritis, Hip/diagnostic imaging , Principal Component Analysis , Radiographic Image Interpretation, Computer-Assisted/methods , Radiography/methods , Risk Factors , Sex Factors
6.
Nucleic Acids Res ; 42(14): e113, 2014 Aug.
Article in English | MEDLINE | ID: mdl-25030904

ABSTRACT

High-throughput sequencing technologies, including RNA-seq, have made it possible to move beyond gene expression analysis to study transcriptional events including alternative splicing and gene fusions. Furthermore, recent studies in cancer have suggested the importance of identifying transcriptionally altered loci as biomarkers for improved prognosis and therapy. While many statistical methods have been proposed for identifying novel transcriptional events with RNA-seq, nearly all rely on contrasting known classes of samples, such as tumor and normal. Few tools exist for the unsupervised discovery of such events without class labels. In this paper, we present SigFuge for identifying genomic loci exhibiting differential transcription patterns across many RNA-seq samples. SigFuge combines clustering with hypothesis testing to identify genes exhibiting alternative splicing, or differences in isoform expression. We apply SigFuge to RNA-seq cohorts of 177 lung and 279 head and neck squamous cell carcinoma samples from the Cancer Genome Atlas, and identify several cases of differential isoform usage including CDKN2A, a tumor suppressor gene known to be inactivated in a majority of lung squamous cell tumors. By not restricting attention to known sample stratifications, SigFuge offers a novel approach to unsupervised screening of genetic loci across RNA-seq cohorts. SigFuge is available as an R package through Bioconductor.


Subject(s)
Gene Expression Profiling/methods , High-Throughput Nucleotide Sequencing/methods , Neoplasms/genetics , RNA Isoforms/metabolism , Sequence Analysis, RNA/methods , Software , Alternative Splicing , Carcinoma, Squamous Cell/genetics , Carrier Proteins/genetics , Cluster Analysis , Exons , Genes, p16 , Genetic Loci , Head and Neck Neoplasms/genetics , Intracellular Signaling Peptides and Proteins , Kallikreins/genetics , Lung Neoplasms/genetics , Nuclear Proteins , Squamous Cell Carcinoma of Head and Neck
7.
Stat Sin ; 26(4): 1747-1770, 2016 Oct.
Article in English | MEDLINE | ID: mdl-28018116

ABSTRACT

The aim of this paper is to establish several deep theoretical properties of principal component analysis for multiple-component spike covariance models. Our new results reveal an asymptotic conical structure in critical sample eigendirections under the spike models with distinguishable (or indistinguishable) eigenvalues, when the sample size and/or the number of variables (or dimension) tend to infinity. The consistency of the sample eigenvectors relative to their population counterparts is determined by the ratio between the dimension and the product of the sample size with the spike size. When this ratio converges to a nonzero constant, the sample eigenvector converges to a cone, with a certain angle to its corresponding population eigenvector. In the High Dimension, Low Sample Size case, the angle between the sample eigenvector and its population counterpart converges to a limiting distribution. Several generalizations of the multi-spike covariance models are also explored, and additional theoretical results are presented.

8.
Nucleic Acids Res ; 41(19): e178, 2013 Oct.
Article in English | MEDLINE | ID: mdl-23935067

ABSTRACT

Identifying variants using high-throughput sequencing data is currently a challenge because true biological variants can be indistinguishable from technical artifacts. One source of technical artifact results from incorrectly aligning experimentally observed sequences to their true genomic origin ('mismapping') and inferring differences in mismapped sequences to be true variants. We developed BlackOPs, an open-source tool that simulates experimental RNA-seq and DNA whole exome sequences derived from the reference genome, aligns these sequences by custom parameters, detects variants and outputs a blacklist of positions and alleles caused by mismapping. Blacklists contain thousands of artifact variants that are indistinguishable from true variants and, for a given sample, are expected to be almost completely false positives. We show that these blacklist positions are specific to the alignment algorithm and read length used, and BlackOPs allows users to generate a blacklist specific to their experimental setup. We queried the dbSNP and COSMIC variant databases and found numerous variants indistinguishable from mapping errors. We demonstrate how filtering against blacklist positions reduces the number of potential false variants using an RNA-seq glioblastoma cell line data set. In summary, accounting for mapping-caused variants tuned to experimental setups reduces false positives and, therefore, improves genome characterization by high-throughput sequencing.


Subject(s)
Genetic Variation , High-Throughput Nucleotide Sequencing/methods , Sequence Alignment/methods , Software , Artifacts , Cell Line, Tumor , Chromosome Mapping , Databases, Nucleic Acid , Exome , Humans , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/methods , Sequence Analysis, RNA/methods
9.
Osteoarthritis Cartilage ; 22(10): 1657-67, 2014 Oct.
Article in English | MEDLINE | ID: mdl-25278075

ABSTRACT

OBJECTIVE: To assess 3D morphological variations and local and systemic biomarker profiles in subjects with a diagnosis of temporomandibular joint osteoarthritis (TMJ OA). DESIGN: Twenty-eight patients with long-term TMJ OA (39.9 ± 16 years), 12 patients at initial diagnosis of OA (47.4 ± 16.1 years), and 12 healthy controls (41.8 ± 12.2 years) were recruited. All patients were female and had cone beam CT scans taken. TMJ arthrocentesis and venipuncture were performed on 12 OA and 12 age-matched healthy controls. Serum and synovial fluid levels of 50 biomarkers of arthritic inflammation were quantified by protein microarrays. Shape Analysis MANCOVA tested statistical correlations between biomarker levels and variations in condylar morphology. RESULTS: Compared with healthy controls, the OA average condyle was significantly smaller in all dimensions except its anterior surface, with areas indicative of bone resorption along the articular surface, particularly in the lateral pole. Synovial fluid levels of ANG, GDF15, TIMP-1, CXCL16, MMP-3 and MMP-7 were significantly correlated with bone apposition of the condylar anterior surface. Serum levels of ENA-78, MMP-3, PAI-1, VE-Cadherin, VEGF, GM-CSF, TGFßb1, IFNγg, TNFαa, IL-1αa, and IL-6 were significantly correlated with flattening of the lateral pole. Expression levels of ANG were significantly correlated with the articular morphology in healthy controls. CONCLUSIONS: Bone resorption at the articular surface, particularly at the lateral pole was statistically significant at initial diagnosis of TMJ OA. Synovial fluid levels of ANG, GDF15, TIMP-1, CXCL16, MMP-3 and MMP-7 were correlated with bone apposition. Serum levels of ENA-78, MMP-3, PAI-1, VE-Cadherin, VEGF, GM-CSF, TGFß1, IFNγ, TNFα, IL-1α, and IL-6 were correlated with bone resorption.


Subject(s)
Inflammation Mediators/metabolism , Osteoarthritis/diagnostic imaging , Synovial Fluid/metabolism , Temporomandibular Joint Disorders/diagnostic imaging , Temporomandibular Joint/diagnostic imaging , Adult , Biomarkers/metabolism , Bone Resorption/diagnostic imaging , Bone Resorption/etiology , Case-Control Studies , Cone-Beam Computed Tomography , Female , Humans , Imaging, Three-Dimensional , Middle Aged , Osteoarthritis/complications , Temporomandibular Joint Disorders/complications , Young Adult
10.
J Comput Graph Stat ; 33(2): 736-748, 2024.
Article in English | MEDLINE | ID: mdl-39170642

ABSTRACT

For measuring the strength of visually-observed subpopulation differences, the Population Difference Criterion is proposed to assess the statistical significance of visually observed subpopulation differences. It addresses the following challenges: in high-dimensional contexts, distributional models can be dubious; in high-signal contexts, conventional permutation tests give poor pairwise comparisons. We also make two other contributions: Based on a careful analysis we find that a balanced permutation approach is more powerful in high-signal contexts than conventional permutations. Another contribution is the quantification of uncertainty due to permutation variation via a bootstrap confidence interval. The practical usefulness of these ideas is illustrated in the comparison of subpopulations of modern cancer data.

11.
J Med Imaging (Bellingham) ; 11(4): 044006, 2024 Jul.
Article in English | MEDLINE | ID: mdl-39185474

ABSTRACT

Purpose: We address the need for effective stain domain adaptation methods in histopathology to enhance the performance of downstream computational tasks, particularly classification. Existing methods exhibit varying strengths and weaknesses, prompting the exploration of a different approach. The focus is on improving stain color consistency, expanding the stain domain scope, and minimizing the domain gap between image batches. Approach: We introduce a new domain adaptation method, Stain simultaneous augmentation and normalization (SAN), designed to adjust the distribution of stain colors to align with a target distribution. Stain SAN combines the merits of established methods, such as stain normalization, stain augmentation, and stain mix-up, while mitigating their inherent limitations. Stain SAN adapts stain domains by resampling stain color matrices from a well-structured target distribution. Results: Experimental evaluations of cross-dataset clinical estrogen receptor status classification demonstrate the efficacy of Stain SAN and its superior performance compared with existing stain adaptation methods. In one case, the area under the curve (AUC) increased by 11.4%. Overall, our results clearly show the improvements made over the history of the development of these methods culminating with substantial enhancement provided by Stain SAN. Furthermore, we show that Stain SAN achieves results comparable with the state-of-the-art generative adversarial network-based approach without requiring separate training for stain adaptation or access to the target domain during training. Stain SAN's performance is on par with HistAuGAN, proving its effectiveness and computational efficiency. Conclusions: Stain SAN emerges as a promising solution, addressing the potential shortcomings of contemporary stain adaptation methods. Its effectiveness is underscored by notable improvements in the context of clinical estrogen receptor status classification, where it achieves the best AUC performance. The findings endorse Stain SAN as a robust approach for stain domain adaptation in histopathology images, with implications for advancing computational tasks in the field.

12.
ArXiv ; 2024 May 16.
Article in English | MEDLINE | ID: mdl-38800658

ABSTRACT

Automated region of interest detection in histopathological image analysis is a challenging and important topic with tremendous potential impact on clinical practice. The deep-learning methods used in computational pathology may help us to reduce costs and increase the speed and accuracy of cancer diagnosis. We started with the UNC Melanocytic Tumor Dataset cohort that contains 160 hematoxylin and eosin whole-slide images of primary melanomas (86) and nevi (74). We randomly assigned 80% (134) as a training set and built an in-house deep-learning method to allow for classification, at the slide level, of nevi and melanomas. The proposed method performed well on the other 20% (26) test dataset; the accuracy of the slide classification task was 92.3% and our model also performed well in terms of predicting the region of interest annotated by the pathologists, showing excellent performance of our model on melanocytic skin tumors. Even though we tested the experiments on the skin tumor dataset, our work could also be extended to other medical image detection problems to benefit the clinical evaluation and diagnosis of different tumors.

13.
Cancers (Basel) ; 16(13)2024 Jun 21.
Article in English | MEDLINE | ID: mdl-39001357

ABSTRACT

High intratumoral heterogeneity is thought to be a poor prognostic indicator. However, the source of heterogeneity may also be important, as genomic heterogeneity is not always reflected in histologic or 'visual' heterogeneity. We aimed to develop a predictor of histologic heterogeneity and evaluate its association with outcomes and molecular heterogeneity. We used VGG16 to train an image classifier to identify unique, patient-specific visual features in 1655 breast tumors (5907 core images) from the Carolina Breast Cancer Study (CBCS). Extracted features for images, as well as the epithelial and stromal image components, were hierarchically clustered, and visual heterogeneity was defined as a greater distance between images from the same patient. We assessed the association between visual heterogeneity, clinical features, and DNA-based molecular heterogeneity using generalized linear models, and we used Cox models to estimate the association between visual heterogeneity and tumor recurrence. Basal-like and ER-negative tumors were more likely to have low visual heterogeneity, as were the tumors from younger and Black women. Less heterogeneous tumors had a higher risk of recurrence (hazard ratio = 1.62, 95% confidence interval = 1.22-2.16), and were more likely to come from patients whose tumors were comprised of only one subclone or had a TP53 mutation. Associations were similar regardless of whether the image was based on stroma, epithelium, or both. Histologic heterogeneity adds complementary information to commonly used molecular indicators, with low heterogeneity predicting worse outcomes. Future work integrating multiple sources of heterogeneity may provide a more comprehensive understanding of tumor progression.

14.
Bioinformatics ; 28(8): 1182-3, 2012 Apr 15.
Article in English | MEDLINE | ID: mdl-22368246

ABSTRACT

UNLABELLED: R/DWD is an extensible package for classification. It is built based on a recently developed powerful classification method called distance weighted discrimination (DWD). DWD is related to, and has been shown to be superior to, the support vector machine in situations that are fundamental to bioinformatics, such as very high dimensional data. DWD has proven to be very useful for several fundamental bioinformatics tasks, including classification, data visualization and removal of biases, such as batch effects. Earlier DWD implementations, however, relied on Matlab, which is not free and requires a license. The major contribution of the R/DWD package is an implementation that is completely in R and thus can be used without any requirements for licensing or software purchase. In addition, R/DWD also provides efficient solvers for second-order-cone-programming and quadratic programming. AVAILABILITY AND IMPLEMENTATION: The package is freely available from cran.r-project.org.


Subject(s)
Computational Biology/methods , Oligonucleotide Array Sequence Analysis/methods , Software , Computer Simulation , Support Vector Machine
15.
Commun Biol ; 6(1): 179, 2023 02 16.
Article in English | MEDLINE | ID: mdl-36797360

ABSTRACT

Model systems are an essential resource in cancer research. They simulate effects that we can infer into humans, but come at a risk of inaccurately representing human biology. This inaccuracy can lead to inconclusive experiments or misleading results, urging the need for an improved process for translating model system findings into human-relevant data. We present a process for applying joint dimension reduction (jDR) to horizontally integrate gene expression data across model systems and human tumor cohorts. We then use this approach to combine human TCGA gene expression data with data from human cancer cell lines and mouse model tumors. By identifying the aspects of genomic variation joint-acting across cohorts, we demonstrate how predictive modeling and clinical biomarkers from model systems can be improved.


Subject(s)
Neoplasms , Transcriptome , Animals , Mice , Humans , Neoplasms/genetics , Neoplasms/pathology , Gene Expression Profiling , Biomarkers
16.
Res Sq ; 2023 Feb 06.
Article in English | MEDLINE | ID: mdl-36798423

ABSTRACT

Background: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. Results: We avoid the crude approximations entailed by such aggregation through proposing an Independent Poisson Distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. Conclusions: This new method has multiple advantages, including (1) no needfor prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson .

17.
Front Genet ; 14: 1093326, 2023.
Article in English | MEDLINE | ID: mdl-37007972

ABSTRACT

Advanced genomic and molecular profiling technologies accelerated the enlightenment of the regulatory mechanisms behind cancer development and progression, and the targeted therapies in patients. Along this line, intense studies with immense amounts of biological information have boosted the discovery of molecular biomarkers. Cancer is one of the leading causes of death around the world in recent years. Elucidation of genomic and epigenetic factors in Breast Cancer (BRCA) can provide a roadmap to uncover the disease mechanisms. Accordingly, unraveling the possible systematic connections between-omics data types and their contribution to BRCA tumor progression is crucial. In this study, we have developed a novel machine learning (ML) based integrative approach for multi-omics data analysis. This integrative approach combines information from gene expression (mRNA), microRNA (miRNA) and methylation data. Due to the complexity of cancer, this integrated data is expected to improve the prediction, diagnosis and treatment of disease through patterns only available from the 3-way interactions between these 3-omics datasets. In addition, the proposed method bridges the interpretation gap between the disease mechanisms that drive onset and progression. Our fundamental contribution is the 3 Multi-omics integrative tool (3Mint). This tool aims to perform grouping and scoring of groups using biological knowledge. Another major goal is improved gene selection via detection of novel groups of cross-omics biomarkers. Performance of 3Mint is assessed using different metrics. Our computational performance evaluations showed that the 3Mint classifies the BRCA molecular subtypes with lower number of genes when compared to the miRcorrNet tool which uses miRNA and mRNA gene expression profiles in terms of similar performance metrics (95% Accuracy). The incorporation of methylation data in 3Mint yields a much more focused analysis. The 3Mint tool and all other supplementary files are available at https://github.com/malikyousef/3Mint/.

18.
NPJ Breast Cancer ; 9(1): 92, 2023 Nov 11.
Article in English | MEDLINE | ID: mdl-37952058

ABSTRACT

Approaches for rapidly identifying patients at high risk of early breast cancer recurrence are needed. Image-based methods for prescreening hematoxylin and eosin (H&E) stained tumor slides could offer temporal and financial efficiency. We evaluated a data set of 704 1-mm tumor core H&E images (2-4 cores per case), corresponding to 202 participants (101 who recurred; 101 non-recurrent matched on age and follow-up time) from breast cancers diagnosed between 2008-2012 in the Carolina Breast Cancer Study. We leveraged deep learning to extract image information and trained a model to identify recurrence. Cross-validation accuracy for predicting recurrence was 62.4% [95% CI: 55.7, 69.1], similar to grade (65.8% [95% CI: 59.3, 72.3]) and ER status (66.3% [95% CI: 59.8, 72.8]). Interestingly, 70% (19/27) of early-recurrent low-intermediate grade tumors were identified by our image model. Relative to existing markers, image-based analyses provide complementary information for predicting early recurrence.

19.
Ann Appl Stat ; 17(4): 2924-2943, 2023 Dec.
Article in English | MEDLINE | ID: mdl-38046186

ABSTRACT

In The Cancer Genome Atlas (TCGA) data set, there are many interesting nonlinear dependencies between pairs of genes that reveal important relationships and subtypes of cancer. Such genomic data analysis requires a rapid, powerful and interpretable detection process, especially in a high-dimensional environment. We study the nonlinear patterns among the expression of pairs of genes from TCGA using a powerful tool called Binary Expansion Testing. We find many nonlinear patterns, some of which are driven by known cancer subtypes, some of which are novel.

20.
Osteoarthr Cartil Open ; 5(1): 100334, 2023 Mar.
Article in English | MEDLINE | ID: mdl-36817090

ABSTRACT

Objective: To employ novel methodologies to identify phenotypes in knee OA based on variation among three baseline data blocks: 1) femoral cartilage thickness, 2) tibial cartilage thickness, and 3) participant characteristics and clinical features. Methods: Baseline data were from 3321 Osteoarthritis Initiative (OAI) participants with available cartilage thickness maps (6265 knees) and 77 clinical features. Cartilage maps were obtained from 3D DESS MR images using a deep-learning based segmentation approach and an atlas-based analysis developed by our group. Angle-based Joint and Individual Variation Explained (AJIVE) was used to capture and quantify variation, both shared among multiple data blocks and individual to each block, and to determine statistical significance. Results: Three major modes of variation were shared across the three data blocks. Mode 1 reflected overall thicker cartilage among men, those with higher education, and greater knee forces; Mode 2 showed associations between worsening Kellgren-Lawrence Grade, medial cartilage thinning, and worsening symptoms; and Mode 3 contrasted lateral and medial-predominant cartilage loss associated with BMI and malalignment. Each data block also demonstrated individual, independent modes of variation consistent with the known discordance between symptoms and structure in knee OA and reflecting the importance of features such as physical function, symptoms, and comorbid conditions independent of structural damage. Conclusions: This exploratory analysis, combining the rich OAI dataset with novel methods for determining and visualizing cartilage thickness, reinforces known associations in knee OA while providing insights into the potential for data integration in knee OA phenotyping.

SELECTION OF CITATIONS
SEARCH DETAIL