Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 5.173
Filter
Add more filters

Publication year range
1.
Cell ; 187(10): 2343-2358, 2024 May 09.
Article in English | MEDLINE | ID: mdl-38729109

ABSTRACT

As the number of single-cell datasets continues to grow rapidly, workflows that map new data to well-curated reference atlases offer enormous promise for the biological community. In this perspective, we discuss key computational challenges and opportunities for single-cell reference-mapping algorithms. We discuss how mapping algorithms will enable the integration of diverse datasets across disease states, molecular modalities, genetic perturbations, and diverse species and will eventually replace manual and laborious unsupervised clustering pipelines.


Subject(s)
Algorithms , Single-Cell Analysis , Single-Cell Analysis/methods , Humans , Computational Biology/methods , Data Analysis , Animals , Cluster Analysis
2.
Cell ; 186(26): 5677-5689, 2023 12 21.
Article in English | MEDLINE | ID: mdl-38065099

ABSTRACT

RNA sequencing in situ allows for whole-transcriptome characterization at high resolution, while retaining spatial information. These data present an analytical challenge for bioinformatics-how to leverage spatial information effectively? Properties of data with a spatial dimension require special handling, which necessitate a different set of statistical and inferential considerations when compared to non-spatial data. The geographical sciences primarily use spatial data and have developed methods to analye them. Here we discuss the challenges associated with spatial analysis and examine how we can take advantage of practice from the geographical sciences to realize the full potential of spatial information in transcriptomic datasets.


Subject(s)
Data Analysis , Spatial Analysis , Transcriptome , Computational Biology , Gene Expression Profiling , Transcriptome/genetics
3.
Nat Rev Genet ; 24(4): 235-250, 2023 04.
Article in English | MEDLINE | ID: mdl-36476810

ABSTRACT

Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.


Subject(s)
Genome , Genomics , Humans , Chromosome Mapping , Data Analysis
4.
Nat Rev Mol Cell Biol ; 23(5): 303-304, 2022 05.
Article in English | MEDLINE | ID: mdl-35197610
5.
Nature ; 621(7977): 206-214, 2023 Sep.
Article in English | MEDLINE | ID: mdl-37648856

ABSTRACT

Transient receptor potential (TRP) channels are a large, eukaryotic ion channel superfamily that control diverse physiological functions, and therefore are attractive drug targets1-5. More than 210 structures from more than 20 different TRP channels have been determined, and all are tetramers4. Despite this wealth of structures, many aspects concerning TRPV channels remain poorly understood, including the pore-dilation phenomenon, whereby prolonged activation leads to increased conductance, permeability to large ions and loss of rectification6,7. Here, we used high-speed atomic force microscopy (HS-AFM) to analyse membrane-embedded TRPV3 at the single-molecule level and discovered a pentameric state. HS-AFM dynamic imaging revealed transience and reversibility of the pentamer in dynamic equilibrium with the canonical tetramer through membrane diffusive protomer exchange. The pentamer population increased upon diphenylboronic anhydride (DPBA) addition, an agonist that has been shown to induce TRPV3 pore dilation. On the basis of these findings, we designed a protein production and data analysis pipeline that resulted in a cryogenic-electron microscopy structure of the TRPV3 pentamer, showing an enlarged pore compared to the tetramer. The slow kinetics to enter and exit the pentameric state, the increased pentamer formation upon DPBA addition and the enlarged pore indicate that the pentamer represents the structural correlate of pore dilation. We thus show membrane diffusive protomer exchange as an additional mechanism for structural changes and conformational variability. Overall, we provide structural evidence for a non-canonical pentameric TRP-channel assembly, laying the foundation for new directions in TRP channel research.


Subject(s)
Protein Multimerization , TRPV Cation Channels , Anhydrides/chemistry , Anhydrides/pharmacology , Data Analysis , Diffusion , Protein Subunits/chemistry , Protein Subunits/drug effects , Protein Subunits/metabolism , TRPV Cation Channels/chemistry , TRPV Cation Channels/drug effects , TRPV Cation Channels/metabolism , TRPV Cation Channels/ultrastructure , Microscopy, Atomic Force , Molecular Targeted Therapy , Cryoelectron Microscopy , Protein Structure, Quaternary/drug effects , Protein Multimerization/drug effects
7.
Nature ; 603(7903): 864-870, 2022 03.
Article in English | MEDLINE | ID: mdl-35296856

ABSTRACT

The COVID-19 pandemic has devastated many low- and middle-income countries, causing widespread food insecurity and a sharp decline in living standards1. In response to this crisis, governments and humanitarian organizations worldwide have distributed social assistance to more than 1.5 billion people2. Targeting is a central challenge in administering these programmes: it remains a difficult task to rapidly identify those with the greatest need given available data3,4. Here we show that data from mobile phone networks can improve the targeting of humanitarian assistance. Our approach uses traditional survey data to train machine-learning algorithms to recognize patterns of poverty in mobile phone data; the trained algorithms can then prioritize aid to the poorest mobile subscribers. We evaluate this approach by studying a flagship emergency cash transfer program in Togo, which used these algorithms to disburse millions of US dollars worth of COVID-19 relief aid. Our analysis compares outcomes-including exclusion errors, total social welfare and measures of fairness-under different targeting regimes. Relative to the geographic targeting options considered by the Government of Togo, the machine-learning approach reduces errors of exclusion by 4-21%. Relative to methods requiring a comprehensive social registry (a hypothetical exercise; no such registry exists in Togo), the machine-learning approach increases exclusion errors by 9-35%. These results highlight the potential for new data sources to complement traditional methods for targeting humanitarian assistance, particularly in crisis settings in which traditional data are missing or out of date.


Subject(s)
COVID-19 , Cell Phone , Machine Learning , Relief Work , COVID-19/epidemiology , Data Analysis , Humans , Pandemics , Poverty
8.
Nat Methods ; 21(7): 1166-1170, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38877315

ABSTRACT

The growth of omic data presents evolving challenges in data manipulation, analysis and integration. Addressing these challenges, Bioconductor provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming offers a revolutionary data organization and manipulation standard. Here we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analyzing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas, spanning six data frameworks and ten analysis tools.


Subject(s)
Software , Humans , Computational Biology/methods , Leukocytes, Mononuclear/metabolism , Leukocytes, Mononuclear/cytology , Genomics/methods , Data Analysis
9.
Nature ; 596(7871): 211-220, 2021 08.
Article in English | MEDLINE | ID: mdl-34381231

ABSTRACT

Deciphering the principles and mechanisms by which gene activity orchestrates complex cellular arrangements in multicellular organisms has far-reaching implications for research in the life sciences. Recent technological advances in next-generation sequencing- and imaging-based approaches have established the power of spatial transcriptomics to measure expression levels of all or most genes systematically throughout tissue space, and have been adopted to generate biological insights in neuroscience, development and plant biology as well as to investigate a range of disease contexts, including cancer. Similar to datasets made possible by genomic sequencing and population health surveys, the large-scale atlases generated by this technology lend themselves to exploratory data analysis for hypothesis generation. Here we review spatial transcriptomic technologies and describe the repertoire of operations available for paths of analysis of the resulting data. Spatial transcriptomics can also be deployed for hypothesis testing using experimental designs that compare time points or conditions-including genetic or environmental perturbations. Finally, spatial transcriptomic data are naturally amenable to integration with other data modalities, providing an expandable framework for insight into tissue organization.


Subject(s)
Gene Expression Profiling/methods , Organ Specificity/genetics , Transcriptome , Animals , Data Analysis , Disease/genetics , Humans , Transcription, Genetic/genetics
10.
Nature ; 597(7874): 119-125, 2021 09.
Article in English | MEDLINE | ID: mdl-34433969

ABSTRACT

Meningiomas are the most common primary intracranial tumour in adults1. Patients with symptoms are generally treated with surgery as there are no effective medical therapies. The World Health Organization histopathological grade of the tumour and the extent of resection at surgery (Simpson grade) are associated with the recurrence of disease; however, they do not accurately reflect the clinical behaviour of all meningiomas2. Molecular classifications of meningioma that reliably reflect tumour behaviour and inform on therapies are required. Here we introduce four consensus molecular groups of meningioma by combining DNA somatic copy-number aberrations, DNA somatic point mutations, DNA methylation and messenger RNA abundance in a unified analysis. These molecular groups more accurately predicted clinical outcomes compared with existing classification schemes. Each molecular group showed distinctive and prototypical biology (immunogenic, benign NF2 wild-type, hypermetabolic and proliferative) that informed therapeutic options. Proteogenomic characterization reinforced the robustness of the newly defined molecular groups and uncovered highly abundant and group-specific protein targets that we validated using immunohistochemistry. Single-cell RNA sequencing revealed inter-individual variations in meningioma as well as variations in intrinsic expression programs in neoplastic cells that mirrored the biology of the molecular groups identified.


Subject(s)
Biomarkers, Tumor/metabolism , Meningioma/classification , Meningioma/metabolism , Proteogenomics , DNA Methylation , Data Analysis , Drug Discovery , Female , Gene Expression Regulation, Neoplastic , Humans , Immunohistochemistry , Male , Meningioma/drug therapy , Meningioma/genetics , Mutation , RNA-Seq , Reproducibility of Results , Single-Cell Analysis
11.
Nature ; 589(7840): 82-87, 2021 01.
Article in English | MEDLINE | ID: mdl-33171481

ABSTRACT

The coronavirus disease 2019 (COVID-19) pandemic markedly changed human mobility patterns, necessitating epidemiological models that can capture the effects of these changes in mobility on the spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)1. Here we introduce a metapopulation susceptible-exposed-infectious-removed (SEIR) model that integrates fine-grained, dynamic mobility networks to simulate the spread of SARS-CoV-2 in ten of the largest US metropolitan areas. Our mobility networks are derived from mobile phone data and map the hourly movements of 98 million people from neighbourhoods (or census block groups) to points of interest such as restaurants and religious establishments, connecting 56,945 census block groups to 552,758 points of interest with 5.4 billion hourly edges. We show that by integrating these networks, a relatively simple SEIR model can accurately fit the real case trajectory, despite substantial changes in the behaviour of the population over time. Our model predicts that a small minority of 'superspreader' points of interest account for a large majority of the infections, and that restricting the maximum occupancy at each point of interest is more effective than uniformly reducing mobility. Our model also correctly predicts higher infection rates among disadvantaged racial and socioeconomic groups2-8 solely as the result of differences in mobility: we find that disadvantaged groups have not been able to reduce their mobility as sharply, and that the points of interest that they visit are more crowded and are therefore associated with higher risk. By capturing who is infected at which locations, our model supports detailed analyses that can inform more-effective and equitable policy responses to COVID-19.


Subject(s)
COVID-19/epidemiology , COVID-19/prevention & control , Computer Simulation , Locomotion , Physical Distancing , Racial Groups/statistics & numerical data , Socioeconomic Factors , COVID-19/transmission , Cell Phone/statistics & numerical data , Data Analysis , Humans , Mobile Applications/statistics & numerical data , Religion , Restaurants/organization & administration , Risk Assessment , Time Factors
12.
PLoS Genet ; 20(3): e1011189, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38484017

ABSTRACT

RNA sequencing (RNA-Seq) is widely used to capture transcriptome dynamics across tissues, biological entities, and conditions. Currently, few or no methods can handle multiple biological variables (e.g., tissues/ phenotypes) and their interactions simultaneously, while also achieving dimension reduction (DR). We propose INSIDER, a general and flexible statistical framework based on matrix factorization, which is freely available at https://github.com/kai0511/insider. INSIDER decomposes variation from different biological variables and their interactions into a shared low-rank latent space. Particularly, it introduces the elastic net penalty to induce sparsity while considering the grouping effects of genes. It can achieve DR of high-dimensional data (of > = 3 dimensions), as opposed to conventional methods (e.g., PCA/NMF) which generally only handle 2D data (e.g., sample × expression). Besides, it enables computing 'adjusted' expression profiles for specific biological variables while controlling variation from other variables. INSIDER is computationally efficient and accommodates missing data. INSIDER also performed similarly or outperformed a close competing method, SDA, as shown in simulations and can handle complex missing data in RNA-Seq data. Moreover, unlike SDA, it can be used when the data cannot be structured into a tensor. Lastly, we demonstrate its usefulness via real data analysis, including clustering donors for disease subtyping, revealing neuro-development trajectory using the BrainSpan data, and uncovering biological processes contributing to variables of interest (e.g., disease status and tissue) and their interactions.


Subject(s)
Algorithms , Transcriptome , Transcriptome/genetics , Sequence Analysis, RNA , Data Analysis , RNA/genetics , Gene Expression Profiling/methods , Single-Cell Analysis/methods , Cluster Analysis
13.
Am J Hum Genet ; 110(5): 762-773, 2023 05 04.
Article in English | MEDLINE | ID: mdl-37019109

ABSTRACT

The ongoing release of large-scale sequencing data in the UK Biobank allows for the identification of associations between rare variants and complex traits. SAIGE-GENE+ is a valid approach to conducting set-based association tests for quantitative and binary traits. However, for ordinal categorical phenotypes, applying SAIGE-GENE+ with treating the trait as quantitative or binarizing the trait can cause inflated type I error rates or power loss. In this study, we propose a scalable and accurate method for rare-variant association tests, POLMM-GENE, in which we used a proportional odds logistic mixed model to characterize ordinal categorical phenotypes while adjusting for sample relatedness. POLMM-GENE fully utilizes the categorical nature of phenotypes and thus can well control type I error rates while remaining powerful. In the analyses of UK Biobank 450k whole-exome-sequencing data for five ordinal categorical traits, POLMM-GENE identified 54 gene-phenotype associations.


Subject(s)
Exome , Genome-Wide Association Study , Genome-Wide Association Study/methods , Exome/genetics , Biological Specimen Banks , Phenotype , Data Analysis , United Kingdom
14.
Genome Res ; 33(2): 261-268, 2023 02.
Article in English | MEDLINE | ID: mdl-36828587

ABSTRACT

There are thousands of well-maintained high-quality open-source software utilities for all aspects of scientific data analysis. For more than a decade, the Galaxy Project has been providing computational infrastructure and a unified user interface for these tools to make them accessible to a wide range of researchers. To streamline the process of integrating tools and constructing workflows as much as possible, we have developed Planemo, a software development kit for tool and workflow developers and Galaxy power users. Here we outline Planemo's implementation and describe its broad range of functionality for designing, testing, and executing Galaxy tools, workflows, and training material. In addition, we discuss the philosophy underlying Galaxy tool and workflow development, and how Planemo encourages the use of development best practices, such as test-driven development, by its users, including those who are not professional software developers.


Subject(s)
Computational Biology , Software , Workflow , Data Analysis
15.
Nat Methods ; 20(11): 1822-1829, 2023 Nov.
Article in English | MEDLINE | ID: mdl-37783883

ABSTRACT

Volumetric brain atlases are increasingly used to integrate and analyze diverse experimental neuroscience data acquired from animal models, but until recently a publicly available digital atlas with complete coverage of the rat brain has been missing. Here we present an update of the Waxholm Space rat brain atlas, a comprehensive open-access volumetric atlas resource. This brain atlas features annotations of 222 structures, of which 112 are new and 57 revised compared to previous versions. It provides a detailed map of the cerebral cortex, hippocampal region, striatopallidal areas, midbrain dopaminergic system, thalamic cell groups, the auditory system and main fiber tracts. We document the criteria underlying the annotations and demonstrate how the atlas with related tools and workflows can be used to support interpretation, integration, analysis and dissemination of experimental rat brain data.


Subject(s)
Brain Mapping , Brain , Rats , Animals , Cerebral Cortex , Dopamine , Data Analysis , Magnetic Resonance Imaging
16.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38701410

ABSTRACT

Potentially pathogenic or probiotic microbes can be identified by comparing their abundance levels between healthy and diseased populations, or more broadly, by linking microbiome composition with clinical phenotypes or environmental factors. However, in microbiome studies, feature tables provide relative rather than absolute abundance of each feature in each sample, as the microbial loads of the samples and the ratios of sequencing depth to microbial load are both unknown and subject to considerable variation. Moreover, microbiome abundance data are count-valued, often over-dispersed and contain a substantial proportion of zeros. To carry out differential abundance analysis while addressing these challenges, we introduce mbDecoda, a model-based approach for debiased analysis of sparse compositions of microbiomes. mbDecoda employs a zero-inflated negative binomial model, linking mean abundance to the variable of interest through a log link function, and it accommodates the adjustment for confounding factors. To efficiently obtain maximum likelihood estimates of model parameters, an Expectation Maximization algorithm is developed. A minimum coverage interval approach is then proposed to rectify compositional bias, enabling accurate and reliable absolute abundance analysis. Through extensive simulation studies and analysis of real-world microbiome datasets, we demonstrate that mbDecoda compares favorably with state-of-the-art methods in terms of effectiveness, robustness and reproducibility.


Subject(s)
Algorithms , Microbiota , Humans , Data Analysis
17.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38349057

ABSTRACT

Efficient and accurate recognition of protein-DNA interactions is vital for understanding the molecular mechanisms of related biological processes and further guiding drug discovery. Although the current experimental protocols are the most precise way to determine protein-DNA binding sites, they tend to be labor-intensive and time-consuming. There is an immediate need to design efficient computational approaches for predicting DNA-binding sites. Here, we proposed ULDNA, a new deep-learning model, to deduce DNA-binding sites from protein sequences. This model leverages an LSTM-attention architecture, embedded with three unsupervised language models that are pre-trained on large-scale sequences from multiple database sources. To prove its effectiveness, ULDNA was tested on 229 protein chains with experimental annotation of DNA-binding sites. Results from computational experiments revealed that ULDNA significantly improves the accuracy of DNA-binding site prediction in comparison with 17 state-of-the-art methods. In-depth data analyses showed that the major strength of ULDNA stems from employing three transformer language models. Specifically, these language models capture complementary feature embeddings with evolution diversity, in which the complex DNA-binding patterns are buried. Meanwhile, the specially crafted LSTM-attention network effectively decodes evolution diversity-based embeddings as DNA-binding results at the residue level. Our findings demonstrated a new pipeline for predicting DNA-binding sites on a large scale with high accuracy from protein sequence alone.


Subject(s)
Data Analysis , Language , Binding Sites , Amino Acid Sequence , Databases, Factual
18.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38426327

ABSTRACT

Cluster assignment is vital to analyzing single-cell RNA sequencing (scRNA-seq) data to understand high-level biological processes. Deep learning-based clustering methods have recently been widely used in scRNA-seq data analysis. However, existing deep models often overlook the interconnections and interactions among network layers, leading to the loss of structural information within the network layers. Herein, we develop a new self-supervised clustering method based on an adaptive multi-scale autoencoder, called scAMAC. The self-supervised clustering network utilizes the Multi-Scale Attention mechanism to fuse the feature information from the encoder, hidden and decoder layers of the multi-scale autoencoder, which enables the exploration of cellular correlations within the same scale and captures deep features across different scales. The self-supervised clustering network calculates the membership matrix using the fused latent features and optimizes the clustering network based on the membership matrix. scAMAC employs an adaptive feedback mechanism to supervise the parameter updates of the multi-scale autoencoder, obtaining a more effective representation of cell features. scAMAC not only enables cell clustering but also performs data reconstruction through the decoding layer. Through extensive experiments, we demonstrate that scAMAC is superior to several advanced clustering and imputation methods in both data clustering and reconstruction. In addition, scAMAC is beneficial for downstream analysis, such as cell trajectory inference. Our scAMAC model codes are freely available at https://github.com/yancy2024/scAMAC.


Subject(s)
Data Analysis , Single-Cell Gene Expression Analysis , Cluster Analysis , Sequence Analysis, RNA , Gene Expression Profiling , Algorithms
19.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38349062

ABSTRACT

Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool to gain biological insights at the cellular level. However, due to technical limitations of the existing sequencing technologies, low gene expression values are often omitted, leading to inaccurate gene counts. Existing methods, including advanced deep learning techniques, struggle to reliably impute gene expressions due to a lack of mechanisms that explicitly consider the underlying biological knowledge of the system. In reality, it has long been recognized that gene-gene interactions may serve as reflective indicators of underlying biology processes, presenting discriminative signatures of the cells. A genomic data analysis framework that is capable of leveraging the underlying gene-gene interactions is thus highly desirable and could allow for more reliable identification of distinctive patterns of the genomic data through extraction and integration of intricate biological characteristics of the genomic data. Here we tackle the problem in two steps to exploit the gene-gene interactions of the system. We first reposition the genes into a 2D grid such that their spatial configuration reflects their interactive relationships. To alleviate the need for labeled ground truth gene expression datasets, a self-supervised 2D convolutional neural network is employed to extract the contextual features of the interactions from the spatially configured genes and impute the omitted values. Extensive experiments with both simulated and experimental scRNA-seq datasets are carried out to demonstrate the superior performance of the proposed strategy against the existing imputation methods.


Subject(s)
Deep Learning , Epistasis, Genetic , Data Analysis , Genomics , Gene Expression , Gene Expression Profiling , Sequence Analysis, RNA
20.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38300514

ABSTRACT

Somatic copy number alterations (SCNAs) are a predominant type of oncogenomic alterations that affect a large proportion of the genome in the majority of cancer samples. Current technologies allow high-throughput measurement of such copy number aberrations, generating results consisting of frequently large sets of SCNA segments. However, the automated annotation and integration of such data are particularly challenging because the measured signals reflect biased, relative copy number ratios. In this study, we introduce labelSeg, an algorithm designed for rapid and accurate annotation of CNA segments, with the aim of enhancing the interpretation of tumor SCNA profiles. Leveraging density-based clustering and exploiting the length-amplitude relationships of SCNA, our algorithm proficiently identifies distinct relative copy number states from individual segment profiles. Its compatibility with most CNA measurement platforms makes it suitable for large-scale integrative data analysis. We confirmed its performance on both simulated and sample-derived data from The Cancer Genome Atlas reference dataset, and we demonstrated its utility in integrating heterogeneous segment profiles from different data sources and measurement platforms. Our comparative and integrative analysis revealed common SCNA patterns in cancer and protein-coding genes with a strong correlation between SCNA and messenger RNA expression, promoting the investigation into the role of SCNA in cancer development.


Subject(s)
DNA Copy Number Variations , Neoplasms , Humans , Neoplasms/genetics , Algorithms , Cluster Analysis , Data Analysis
SELECTION OF CITATIONS
SEARCH DETAIL