Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 2.190
Filtrar
Mais filtros

Tipo de documento
Intervalo de ano de publicação
1.
Cell ; 185(1): 184-203.e19, 2022 01 06.
Artigo em Inglês | MEDLINE | ID: mdl-34963056

RESUMO

Cancers display significant heterogeneity with respect to tissue of origin, driver mutations, and other features of the surrounding tissue. It is likely that individual tumors engage common patterns of the immune system-here "archetypes"-creating prototypical non-destructive tumor immune microenvironments (TMEs) and modulating tumor-targeting. To discover the dominant immune system archetypes, the University of California, San Francisco (UCSF) Immunoprofiler Initiative (IPI) processed 364 individual tumors across 12 cancer types using standardized protocols. Computational clustering of flow cytometry and transcriptomic data obtained from cell sub-compartments uncovered dominant patterns of immune composition across cancers. These archetypes were profound insofar as they also differentiated tumors based upon unique immune and tumor gene-expression patterns. They also partitioned well-established classifications of tumor biology. The IPI resource provides a template for understanding cancer immunity as a collection of dominant patterns of immune organization and provides a rational path forward to learn how to modulate these to improve therapy.


Assuntos
Censos , Neoplasias/genética , Neoplasias/imunologia , Transcriptoma/genética , Microambiente Tumoral/imunologia , Biomarcadores Tumorais , Análise por Conglomerados , Estudos de Coortes , Biologia Computacional/métodos , Citometria de Fluxo/métodos , Regulação Neoplásica da Expressão Gênica , Humanos , Neoplasias/classificação , Neoplasias/patologia , RNA-Seq/métodos , São Francisco , Universidades
2.
Mol Cell ; 78(1): 96-111.e6, 2020 04 02.
Artigo em Inglês | MEDLINE | ID: mdl-32105612

RESUMO

Current models suggest that chromosome domains segregate into either an active (A) or inactive (B) compartment. B-compartment chromatin is physically separated from the A compartment and compacted by the nuclear lamina. To examine these models in the developmental context of C. elegans embryogenesis, we undertook chromosome tracing to map the trajectories of entire autosomes. Early embryonic chromosomes organized into an unconventional barbell-like configuration, with two densely folded B compartments separated by a central A compartment. Upon gastrulation, this conformation matured into conventional A/B compartments. We used unsupervised clustering to uncover subpopulations with differing folding properties and variable positioning of compartment boundaries. These conformations relied on tethering to the lamina to stretch the chromosome; detachment from the lamina compacted, and allowed intermingling between, A/B compartments. These findings reveal the diverse conformations of early embryonic chromosomes and uncover a previously unappreciated role for the lamina in systemic chromosome stretching.


Assuntos
Caenorhabditis elegans/genética , Cromossomos/química , Lâmina Nuclear/fisiologia , Animais , Caenorhabditis elegans/embriologia , Cromossomos/ultraestrutura , Embrião não Mamífero/ultraestrutura , Gastrulação/genética , Hibridização in Situ Fluorescente , Conformação Molecular
3.
Proc Natl Acad Sci U S A ; 121(37): e2319804121, 2024 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-39226356

RESUMO

The rapid growth of large-scale spatial gene expression data demands efficient and reliable computational tools to extract major trends of gene expression in their native spatial context. Here, we used stability-driven unsupervised learning (i.e., staNMF) to identify principal patterns (PPs) of 3D gene expression profiles and understand spatial gene distribution and anatomical localization at the whole mouse brain level. Our subsequent spatial correlation analysis systematically compared the PPs to known anatomical regions and ontology from the Allen Mouse Brain Atlas using spatial neighborhoods. We demonstrate that our stable and spatially coherent PPs, whose linear combinations accurately approximate the spatial gene data, are highly correlated with combinations of expert-annotated brain regions. These PPs yield a brain ontology based purely on spatial gene expression. Our PP identification approach outperforms principal component analysis and typical clustering algorithms on the same task. Moreover, we show that the stable PPs reveal marked regional imbalance of brainwide genetic architecture, leading to region-specific marker genes and gene coexpression networks. Our findings highlight the advantages of stability-driven machine learning for plausible biological discovery from dense spatial gene expression data, streamlining tasks that are infeasible by conventional manual approaches.


Assuntos
Encéfalo , Animais , Camundongos , Encéfalo/metabolismo , Perfilação da Expressão Gênica/métodos , Transcriptoma , Algoritmos , Aprendizado de Máquina não Supervisionado , Ontologia Genética , Atlas como Assunto , Redes Reguladoras de Genes , Análise de Componente Principal
4.
Proc Natl Acad Sci U S A ; 121(33): e2403771121, 2024 Aug 13.
Artigo em Inglês | MEDLINE | ID: mdl-39110730

RESUMO

Complex systems are typically characterized by intricate internal dynamics that are often hard to elucidate. Ideally, this requires methods that allow to detect and classify in an unsupervised way the microscopic dynamical events occurring in the system. However, decoupling statistically relevant fluctuations from the internal noise remains most often nontrivial. Here, we describe "Onion Clustering": a simple, iterative unsupervised clustering method that efficiently detects and classifies statistically relevant fluctuations in noisy time-series data. We demonstrate its efficiency by analyzing simulation and experimental trajectories of various systems with complex internal dynamics, ranging from the atomic- to the microscopic-scale, in- and out-of-equilibrium. The method is based on an iterative detect-classify-archive approach. In a similar way as peeling the external (evident) layer of an onion reveals the internal hidden ones, the method performs a first detection/classification of the most populated dynamical environment in the system and of its characteristic noise. The signal of such dynamical cluster is then removed from the time-series data and the remaining part, cleared-out from its noise, is analyzed again. At every iteration, the detection of hidden dynamical subdomains is facilitated by an increasing (and adaptive) relevance-to-noise ratio. The process iterates until no new dynamical domains can be uncovered, revealing, as an output, the number of clusters that can be effectively distinguished/classified in a statistically robust way as a function of the time-resolution of the analysis. Onion Clustering is general and benefits from clear-cut physical interpretability. We expect that it will help analyzing a variety of complex dynamical systems and time-series data.

5.
Proc Natl Acad Sci U S A ; 121(37): e2400002121, 2024 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-39226348

RESUMO

Single-cell RNA sequencing (scRNA-seq) data, susceptible to noise arising from biological variability and technical errors, can distort gene expression analysis and impact cell similarity assessments, particularly in heterogeneous populations. Current methods, including deep learning approaches, often struggle to accurately characterize cell relationships due to this inherent noise. To address these challenges, we introduce scAMF (Single-cell Analysis via Manifold Fitting), a framework designed to enhance clustering accuracy and data visualization in scRNA-seq studies. At the heart of scAMF lies the manifold fitting module, which effectively denoises scRNA-seq data by unfolding their distribution in the ambient space. This unfolding aligns the gene expression vector of each cell more closely with its underlying structure, bringing it spatially closer to other cells of the same cell type. To comprehensively assess the impact of scAMF, we compile a collection of 25 publicly available scRNA-seq datasets spanning various sequencing platforms, species, and organ types, forming an extensive RNA data bank. In our comparative studies, benchmarking scAMF against existing scRNA-seq analysis algorithms in this data bank, we consistently observe that scAMF outperforms in terms of clustering efficiency and data visualization clarity. Further experimental analysis reveals that this enhanced performance stems from scAMF's ability to improve the spatial distribution of the data and capture class-consistent neighborhoods. These findings underscore the promising application potential of manifold fitting as a tool in scRNA-seq analysis, signaling a significant enhancement in the precision and reliability of data interpretation in this critical field of study.


Assuntos
Análise de Célula Única , Análise de Célula Única/métodos , Análise por Conglomerados , Humanos , Análise de Sequência de RNA/métodos , Animais , Algoritmos , RNA/genética , Perfilação da Expressão Gênica/métodos , RNA-Seq/métodos
6.
J Cell Sci ; 137(20)2024 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-38738282

RESUMO

Advances in imaging, segmentation and tracking have led to the routine generation of large and complex microscopy datasets. New tools are required to process this 'phenomics' type data. Here, we present 'Cell PLasticity Analysis Tool' (cellPLATO), a Python-based analysis software designed for measurement and classification of cell behaviours based on clustering features of cell morphology and motility. Used after segmentation and tracking, the tool extracts features from each cell per timepoint, using them to segregate cells into dimensionally reduced behavioural subtypes. Resultant cell tracks describe a 'behavioural ID' at each timepoint, and similarity analysis allows the grouping of behavioural sequences into discrete trajectories with assigned IDs. Here, we use cellPLATO to investigate the role of IL-15 in modulating human natural killer (NK) cell migration on ICAM-1 or VCAM-1. We find eight behavioural subsets of NK cells based on their shape and migration dynamics between single timepoints, and four trajectories based on sequences of these behaviours over time. Therefore, by using cellPLATO, we show that IL-15 increases plasticity between cell migration behaviours and that different integrin ligands induce different forms of NK cell migration.


Assuntos
Movimento Celular , Interleucina-15 , Células Matadoras Naturais , Humanos , Células Matadoras Naturais/citologia , Células Matadoras Naturais/metabolismo , Células Matadoras Naturais/imunologia , Interleucina-15/metabolismo , Software , Molécula 1 de Adesão Intercelular/metabolismo , Molécula 1 de Adesão de Célula Vascular/metabolismo
7.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-38975893

RESUMO

The process of drug discovery is widely known to be lengthy and resource-intensive. Artificial Intelligence approaches bring hope for accelerating the identification of molecules with the necessary properties for drug development. Drug-likeness assessment is crucial for the virtual screening of candidate drugs. However, traditional methods like Quantitative Estimation of Drug-likeness (QED) struggle to distinguish between drug and non-drug molecules accurately. Additionally, some deep learning-based binary classification models heavily rely on selecting training negative sets. To address these challenges, we introduce a novel unsupervised learning framework called DrugMetric, an innovative framework for quantitatively assessing drug-likeness based on the chemical space distance. DrugMetric blends the powerful learning ability of variational autoencoders with the discriminative ability of the Gaussian Mixture Model. This synergy enables DrugMetric to identify significant differences in drug-likeness across different datasets effectively. Moreover, DrugMetric incorporates principles of ensemble learning to enhance its predictive capabilities. Upon testing over a variety of tasks and datasets, DrugMetric consistently showcases superior scoring and classification performance. It excels in quantifying drug-likeness and accurately distinguishing candidate drugs from non-drugs, surpassing traditional methods including QED. This work highlights DrugMetric as a practical tool for drug-likeness scoring, facilitating the acceleration of virtual drug screening, and has potential applications in other biochemical fields.


Assuntos
Descoberta de Drogas , Descoberta de Drogas/métodos , Preparações Farmacêuticas/química , Preparações Farmacêuticas/classificação , Algoritmos , Aprendizado Profundo , Inteligência Artificial
8.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38483256

RESUMO

Numerous imaging techniques are available for observing and interrogating biological samples, and several of them can be used consecutively to enable correlative analysis of different image modalities with varying resolutions and the inclusion of structural or molecular information. Achieving accurate registration of multimodal images is essential for the correlative analysis process, but it remains a challenging computer vision task with no widely accepted solution. Moreover, supervised registration methods require annotated data produced by experts, which is limited. To address this challenge, we propose a general unsupervised pipeline for multimodal image registration using deep learning. We provide a comprehensive evaluation of the proposed pipeline versus the current state-of-the-art image registration and style transfer methods on four types of biological problems utilizing different microscopy modalities. We found that style transfer of modality domains paired with fully unsupervised training leads to comparable image registration accuracy to supervised methods and, most importantly, does not require human intervention.


Assuntos
Aprendizado Profundo , Humanos , Microscopia
9.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-38819253

RESUMO

Spatially resolved transcriptomics (SRT) has emerged as a powerful tool for investigating gene expression in spatial contexts, providing insights into the molecular mechanisms underlying organ development and disease pathology. However, the expression sparsity poses a computational challenge to integrate other modalities (e.g. histological images and spatial locations) that are simultaneously captured in SRT datasets for spatial clustering and variation analyses. In this study, to meet such a challenge, we propose multi-modal domain adaption for spatial transcriptomics (stMDA), a novel multi-modal unsupervised domain adaptation method, which integrates gene expression and other modalities to reveal the spatial functional landscape. Specifically, stMDA first learns the modality-specific representations from spatial multi-modal data using multiple neural network architectures and then aligns the spatial distributions across modal representations to integrate these multi-modal representations, thus facilitating the integration of global and spatially local information and improving the consistency of clustering assignments. Our results demonstrate that stMDA outperforms existing methods in identifying spatial domains across diverse platforms and species. Furthermore, stMDA excels in identifying spatially variable genes with high prognostic potential in cancer tissues. In conclusion, stMDA as a new tool of multi-modal data integration provides a powerful and flexible framework for analyzing SRT datasets, thereby advancing our understanding of intricate biological systems.


Assuntos
Perfilação da Expressão Gênica , Transcriptoma , Humanos , Perfilação da Expressão Gênica/métodos , Análise por Conglomerados , Biologia Computacional/métodos , Redes Neurais de Computação , Neoplasias/genética , Algoritmos
10.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38349057

RESUMO

Efficient and accurate recognition of protein-DNA interactions is vital for understanding the molecular mechanisms of related biological processes and further guiding drug discovery. Although the current experimental protocols are the most precise way to determine protein-DNA binding sites, they tend to be labor-intensive and time-consuming. There is an immediate need to design efficient computational approaches for predicting DNA-binding sites. Here, we proposed ULDNA, a new deep-learning model, to deduce DNA-binding sites from protein sequences. This model leverages an LSTM-attention architecture, embedded with three unsupervised language models that are pre-trained on large-scale sequences from multiple database sources. To prove its effectiveness, ULDNA was tested on 229 protein chains with experimental annotation of DNA-binding sites. Results from computational experiments revealed that ULDNA significantly improves the accuracy of DNA-binding site prediction in comparison with 17 state-of-the-art methods. In-depth data analyses showed that the major strength of ULDNA stems from employing three transformer language models. Specifically, these language models capture complementary feature embeddings with evolution diversity, in which the complex DNA-binding patterns are buried. Meanwhile, the specially crafted LSTM-attention network effectively decodes evolution diversity-based embeddings as DNA-binding results at the residue level. Our findings demonstrated a new pipeline for predicting DNA-binding sites on a large scale with high accuracy from protein sequence alone.


Assuntos
Análise de Dados , Idioma , Sítios de Ligação , Sequência de Aminoácidos , Bases de Dados Factuais
11.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38487848

RESUMO

The major histocompatibility complex (MHC) encodes a range of immune response genes, including the human leukocyte antigens (HLAs) in humans. These molecules bind peptide antigens and present them on the cell surface for T cell recognition. The repertoires of peptides presented by HLA molecules are termed immunopeptidomes. The highly polymorphic nature of the genres that encode the HLA molecules confers allotype-specific differences in the sequences of bound ligands. Allotype-specific ligand preferences are often defined by peptide-binding motifs. Individuals express up to six classical class I HLA allotypes, which likely present peptides displaying different binding motifs. Such complex datasets make the deconvolution of immunopeptidomic data into allotype-specific contributions and further dissection of binding-specificities challenging. Herein, we developed MHCpLogics as an interactive machine learning-based tool for mining peptide-binding sequence motifs and visualization of immunopeptidome data across complex datasets. We showcase the functionalities of MHCpLogics by analyzing both in-house and published mono- and multi-allelic immunopeptidomics data. The visualization modalities of MHCpLogics allow users to inspect clustered sequences down to individual peptide components and to examine broader sequence patterns within multiple immunopeptidome datasets. MHCpLogics can deconvolute large immunopeptidome datasets enabling the interrogation of clusters for the segregation of allotype-specific peptide sequence motifs, identification of sub-peptidome motifs, and the exportation of clustered peptide sequence lists. The tool facilitates rapid inspection of immunopeptidomes as a resource for the immunology and vaccine communities. MHCpLogics is a standalone application available via an executable installation at: https://github.com/PurcellLab/MHCpLogics.


Assuntos
Visualização de Dados , Peptídeos , Humanos , Peptídeos/química , Antígenos HLA/genética , Antígenos de Histocompatibilidade , Aprendizado de Máquina , Análise por Conglomerados
12.
Proc Natl Acad Sci U S A ; 120(15): e2213149120, 2023 04 11.
Artigo em Inglês | MEDLINE | ID: mdl-37027429

RESUMO

Cryoelectron tomography directly visualizes heterogeneous macromolecular structures in their native and complex cellular environments. However, existing computer-assisted structure sorting approaches are low throughput or inherently limited due to their dependency on available templates and manual labels. Here, we introduce a high-throughput template-and-label-free deep learning approach, Deep Iterative Subtomogram Clustering Approach (DISCA), that automatically detects subsets of homogeneous structures by learning and modeling 3D structural features and their distributions. Evaluation on five experimental cryo-ET datasets shows that an unsupervised deep learning based method can detect diverse structures with a wide range of molecular sizes. This unsupervised detection paves the way for systematic unbiased recognition of macromolecular complexes in situ.


Assuntos
Tomografia com Microscopia Eletrônica , Processamento de Imagem Assistida por Computador , Processamento de Imagem Assistida por Computador/métodos , Análise por Conglomerados , Estrutura Molecular , Tomografia com Microscopia Eletrônica/métodos , Substâncias Macromoleculares/química , Microscopia Crioeletrônica/métodos
13.
J Neurosci ; 44(5)2024 Jan 31.
Artigo em Inglês | MEDLINE | ID: mdl-37989593

RESUMO

Scientists have long conjectured that the neocortex learns patterns in sensory data to generate top-down predictions of upcoming stimuli. In line with this conjecture, different responses to pattern-matching vs pattern-violating visual stimuli have been observed in both spiking and somatic calcium imaging data. However, it remains unknown whether these pattern-violation signals are different between the distal apical dendrites, which are heavily targeted by top-down signals, and the somata, where bottom-up information is primarily integrated. Furthermore, it is unknown how responses to pattern-violating stimuli evolve over time as an animal gains more experience with them. Here, we address these unanswered questions by analyzing responses of individual somata and dendritic branches of layer 2/3 and layer 5 pyramidal neurons tracked over multiple days in primary visual cortex of awake, behaving female and male mice. We use sequences of Gabor patches with patterns in their orientations to create pattern-matching and pattern-violating stimuli, and two-photon calcium imaging to record neuronal responses. Many neurons in both layers show large differences between their responses to pattern-matching and pattern-violating stimuli. Interestingly, these responses evolve in opposite directions in the somata and distal apical dendrites, with somata becoming less sensitive to pattern-violating stimuli and distal apical dendrites more sensitive. These differences between the somata and distal apical dendrites may be important for hierarchical computation of sensory predictions and learning, since these two compartments tend to receive bottom-up and top-down information, respectively.


Assuntos
Cálcio , Neocórtex , Masculino , Feminino , Camundongos , Animais , Cálcio/fisiologia , Neurônios/fisiologia , Dendritos/fisiologia , Células Piramidais/fisiologia , Neocórtex/fisiologia
14.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36458445

RESUMO

Deciphering 3D genome conformation is important for understanding gene regulation and cellular function at a spatial level. The recent advances of single cell Hi-C technologies have enabled the profiling of the 3D architecture of DNA within individual cell, which allows us to study the cell-to-cell variability of 3D chromatin organization. Computational approaches are in urgent need to comprehensively analyze the sparse and heterogeneous single cell Hi-C data. Here, we proposed scDEC-Hi-C, a new framework for single cell Hi-C analysis with deep generative neural networks. scDEC-Hi-C outperforms existing methods in terms of single cell Hi-C data clustering and imputation. Moreover, the generative power of scDEC-Hi-C could help unveil the differences of chromatin architecture across cell types. We expect that scDEC-Hi-C could shed light on deepening our understanding of the complex mechanism underlying the formation of chromatin contacts.


Assuntos
Cromatina , Cromossomos , Cromatina/genética , Genoma , DNA , Análise por Conglomerados
15.
Brief Bioinform ; 24(2)2023 03 19.
Artigo em Inglês | MEDLINE | ID: mdl-36723605

RESUMO

Identifying gene regulatory networks (GRNs) at the resolution of single cells has long been a great challenge, and the advent of single-cell multi-omics data provides unprecedented opportunities to construct GRNs. Here, we propose a novel strategy to integrate omics datasets of single-cell ribonucleic acid sequencing and single-cell Assay for Transposase-Accessible Chromatin using sequencing, and using an unsupervised learning neural network to divide the samples with high copy number variation scores, which are used to infer the GRN in each gene block. Accuracy validation of proposed strategy shows that approximately 80% of transcription factors are directly associated with cancer, colorectal cancer, malignancy and disease by TRRUST; and most transcription factors are prone to produce multiple transcript variants and lead to tumorigenesis by RegNetwork database, respectively. The source code access are available at: https://github.com/Cuily-v/Colorectal_cancer.


Assuntos
Neoplasias Colorretais , Redes Reguladoras de Genes , Humanos , Multiômica , Variações do Número de Cópias de DNA , Algoritmos , Fatores de Transcrição/genética , Neoplasias Colorretais/genética
16.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37974508

RESUMO

Current methods of molecular image-based drug discovery face two major challenges: (1) work effectively in absence of labels, and (2) capture chemical structure from implicitly encoded images. Given that chemical structures are explicitly encoded by molecular graphs (such as nitrogen, benzene rings and double bonds), we leverage self-supervised contrastive learning to transfer chemical knowledge from graphs to images. Specifically, we propose a novel Contrastive Graph-Image Pre-training (CGIP) framework for molecular representation learning, which learns explicit information in graphs and implicit information in images from large-scale unlabeled molecules via carefully designed intra- and inter-modal contrastive learning. We evaluate the performance of CGIP on multiple experimental settings (molecular property prediction, cross-modal retrieval and distribution similarity), and the results show that CGIP can achieve state-of-the-art performance on all 12 benchmark datasets and demonstrate that CGIP transfers chemical knowledge in graphs to molecular images, enabling image encoder to perceive chemical structures in images. We hope this simple and effective framework will inspire people to think about the value of image for molecular representation learning.


Assuntos
Benchmarking , Aprendizagem , Humanos , Descoberta de Drogas
17.
Proc Natl Acad Sci U S A ; 119(8)2022 02 22.
Artigo em Inglês | MEDLINE | ID: mdl-35181603

RESUMO

High-frequency (HF) signals are ubiquitous in the industrial world and are of great use for monitoring of industrial assets. Most deep-learning tools are designed for inputs of fixed and/or very limited size and many successful applications of deep learning to the industrial context use as inputs extracted features, which are a manually and often arduously obtained compact representation of the original signal. In this paper, we propose a fully unsupervised deep-learning framework that is able to extract a meaningful and sparse representation of raw HF signals. We embed in our architecture important properties of the fast discrete wavelet transform (FDWT) such as 1) the cascade algorithm; 2) the conjugate quadrature filter property that links together the wavelet, the scaling, and transposed filter functions; and 3) the coefficient denoising. Using deep learning, we make this architecture fully learnable: Both the wavelet bases and the wavelet coefficient denoising become learnable. To achieve this objective, we propose an activation function that performs a learnable hard thresholding of the wavelet coefficients. With our framework, the denoising FDWT becomes a fully learnable unsupervised tool that does not require any type of pre- or postprocessing or any prior knowledge on wavelet transform. We demonstrate the benefits of embedding all these properties on three machine-learning tasks performed on open-source sound datasets. We perform an ablation study of the impact of each property on the performance of the architecture, achieve results well above baseline, and outperform other state-of-the-art methods.

18.
BMC Bioinformatics ; 25(1): 42, 2024 Jan 25.
Artigo em Inglês | MEDLINE | ID: mdl-38273275

RESUMO

BACKGROUND: The clustering of immune repertoire data is challenging due to the computational cost associated with a very large number of pairwise sequence comparisons. To overcome this limitation, we developed Anchor Clustering, an unsupervised clustering method designed to identify similar sequences from millions of antigen receptor gene sequences. First, a Point Packing algorithm is used to identify a set of maximally spaced anchor sequences. Then, the genetic distance of the remaining sequences to all anchor sequences is calculated and transformed into distance vectors. Finally, distance vectors are clustered using unsupervised clustering. This process is repeated iteratively until the resulting clusters are small enough so that pairwise distance comparisons can be performed. RESULTS: Our results demonstrate that Anchor Clustering is faster than existing pairwise comparison clustering methods while providing similar clustering quality. With its flexible, memory-saving strategy, Anchor Clustering is capable of clustering millions of antigen receptor gene sequences in just a few minutes. CONCLUSIONS: This method enables the meta-analysis of immune-repertoire data from different studies and could contribute to a more comprehensive understanding of the immune repertoire data space.


Assuntos
Algoritmos , Receptores de Antígenos , Análise por Conglomerados
19.
BMC Bioinformatics ; 25(1): 58, 2024 Feb 05.
Artigo em Inglês | MEDLINE | ID: mdl-38317062

RESUMO

BACKGROUND: Data from microbiomes from multiple niches is often collected, but methods to analyse these often ignore associations between niches. One interesting case is that of the oral microbiome. Its composition is receiving increasing attention due to reports on its associations with general health. While the oral cavity includes different niches, multi-niche microbiome data analysis is conducted using a single niche at a time and, therefore, ignores other niches that could act as confounding variables. Understanding the interaction between niches would assist interpretation of the results, and help improve our understanding of multi-niche microbiomes. METHODS: In this study, we used a machine learning technique called latent Dirichlet allocation (LDA) on two microbiome datasets consisting of several niches. LDA was used on both individual niches and all niches simultaneously. On individual niches, LDA was used to decompose each niche into bacterial sub-communities unveiling their taxonomic structure. These sub-communities were then used to assess the relationship between microbial niches using the global test. On all niches simultaneously, LDA allowed us to extract meaningful microbial patterns. Sets of co-occurring operational taxonomic units (OTUs) comprising those patterns were then used to predict the original location of each sample. RESULTS: Our approach showed that the per-niche sub-communities displayed a strong association between supragingival plaque and saliva, as well as between the anterior and posterior tongue. In addition, the LDA-derived microbial signatures were able to predict the original sample niche illustrating the meaningfulness of our sub-communities. For the multi-niche oral microbiome dataset we had an overall accuracy of 76%, and per-niche sensitivity of up to 83%. Finally, for a second multi-niche microbiome dataset from the entire body, microbial niches from the oral cavity displayed stronger associations to each other than with those from other parts of the body, such as niches within the vagina and the skin. CONCLUSION: Our LDA-based approach produces sets of co-occurring taxa that can describe niche composition. LDA-derived microbial signatures can also be instrumental in summarizing microbiome data, for both descriptions as well as prediction.


Assuntos
Microbiota , Feminino , Humanos , Boca/microbiologia , Bactérias/genética , Saliva , Pele/microbiologia
20.
Diabetologia ; 67(8): 1552-1566, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38801521

RESUMO

AIMS/HYPOTHESIS: Gestational diabetes mellitus (GDM) is a heterogeneous condition. Given such variability among patients, the ability to recognise distinct GDM subgroups using routine clinical variables may guide more personalised treatments. Our main aim was to identify distinct GDM subtypes through cluster analysis using routine clinical variables, and analyse treatment needs and pregnancy outcomes across these subgroups. METHODS: In this cohort study, we analysed datasets from a total of 2682 women with GDM treated at two central European hospitals (1865 participants from Charité University Hospital in Berlin and 817 participants from the Medical University of Vienna), collected between 2015 and 2022. We evaluated various clustering models, including k-means, k-medoids and agglomerative hierarchical clustering. Internal validation techniques were used to guide best model selection, while external validation on independent test sets was used to assess model generalisability. Clinical outcomes such as specific treatment needs and maternal and fetal complications were analysed across the identified clusters. RESULTS: Our optimal model identified three clusters from routinely available variables, i.e. maternal age, pre-pregnancy BMI (BMIPG) and glucose levels at fasting and 60 and 120 min after the diagnostic OGTT (OGTT0, OGTT60 and OGTT120, respectively). Cluster 1 was characterised by the highest OGTT values and obesity prevalence. Cluster 2 displayed intermediate BMIPG and elevated OGTT0, while cluster 3 consisted mainly of participants with normal BMIPG and high values for OGTT60 and OGTT120. Treatment modalities and clinical outcomes varied among clusters. In particular, cluster 1 participants showed a much higher need for glucose-lowering medications (39.6% of participants, compared with 12.9% and 10.0% in clusters 2 and 3, respectively, p<0.0001). Cluster 1 participants were also at higher risk of delivering large-for-gestational-age infants. Differences in the type of insulin-based treatment between cluster 2 and cluster 3 were observed in the external validation cohort. CONCLUSIONS/INTERPRETATION: Our findings confirm the heterogeneity of GDM. The identification of subgroups (clusters) has the potential to help clinicians define more tailored treatment approaches for improved maternal and neonatal outcomes.


Assuntos
Diabetes Gestacional , Humanos , Diabetes Gestacional/epidemiologia , Diabetes Gestacional/diagnóstico , Feminino , Gravidez , Adulto , Análise por Conglomerados , Índice de Massa Corporal , Resultado da Gravidez/epidemiologia , Teste de Tolerância a Glucose , Glicemia/metabolismo , Estudos de Coortes , Idade Materna
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA