RESUMO
Animal survival requires a functioning nervous system to develop during embryogenesis. Newborn neurons must assemble into circuits producing activity patterns capable of instructing behaviors. Elucidating how this process is coordinated requires new methods that follow maturation and activity of all cells across a developing circuit. We present an imaging method for comprehensively tracking neuron lineages, movements, molecular identities, and activity in the entire developing zebrafish spinal cord, from neurogenesis until the emergence of patterned activity instructing the earliest spontaneous motor behavior. We found that motoneurons are active first and form local patterned ensembles with neighboring neurons. These ensembles merge, synchronize globally after reaching a threshold size, and finally recruit commissural interneurons to orchestrate the left-right alternating patterns important for locomotion in vertebrates. Individual neurons undergo functional maturation stereotypically based on their birth time and anatomical origin. Our study provides a general strategy for reconstructing how functioning circuits emerge during embryogenesis. VIDEO ABSTRACT.
RESUMO
Investigating large datasets of biological information by automatic procedures may offer chances of progress in knowledge. Recently, tremendous improvements in structural biology have allowed the number of structures in the Protein Data Bank (PDB) archive to increase rapidly, in particular those for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)-associated proteins. However, their automatic analysis can be hampered by the nonuniform descriptors used by authors in some records of the PDB and PDBx/mmCIF files. In this opinion article we highlight the difficulties encountered in automating the analysis of hundreds of structures, suggesting that further standardization of the description of these molecular entities and of their attributes, generalized to the macromolecular structures contained in the PDB, might generate files more suitable for automatized analyses of a large number of structures.
Assuntos
COVID-19 , Humanos , SARS-CoV-2 , Proteínas/química , Estrutura Molecular , Bases de Dados de Proteínas , Conformação ProteicaRESUMO
In the past decade, topological data analysis has emerged as a powerful algebraic topology approach in data science. Although knot theory and related subjects are a focus of study in mathematics, their success in practical applications is quite limited due to the lack of localization and quantization. We address these challenges by introducing knot data analysis (KDA), a paradigm that incorporates curve segmentation and multiscale analysis into the Gauss link integral. The resulting multiscale Gauss link integral (mGLI) recovers the global topological properties of knots and links at an appropriate scale and offers a multiscale geometric topology approach to capture the local structures and connectivities in data. By integration with machine learning or deep learning, the proposed mGLI significantly outperforms other state-of-the-art methods across various benchmark problems in 13 intricately complex biological datasets, including protein flexibility analysis, protein-ligand interactions, human Ether-à-go-go-Related Gene potassium channel blockade screening, and quantitative toxicity assessment. Our KDA opens a research area-knot deep learning-in data science.
RESUMO
Understanding the historical perception and value of teacher personalities reveals key educational priorities and societal expectations. This study analyzes the evolution of teachers' ascribed Big Five personality traits from 1800 to 2019, drawing on millions of English-language books. Word frequency analysis reveals that conscientiousness is the most frequently discussed trait, followed by agreeableness, openness, extraversion, and neuroticism. This pattern underscores society's focus on whether teachers are responsible. Polarity analysis further indicates a higher prevalence of low neuroticism descriptors (e.g., patient and tolerant) in descriptions of teachers compared to the general population, reinforcing the perception of teachers as stable and dependable. The frequent use of terms like "moral", "enthusiastic", and "practical" in describing teachers highlights the positive portrayal of their personalities. However, since the mid-20th century, there has been a notable rise in negative descriptors related to openness (e.g., traditional and conventional), coupled with a decline in positive openness terms. This shift suggests an evolving view of teachers as less receptive to new ideas. These findings offer valuable insights into the historical portrayal and societal values attributed to teacher personalities.
Assuntos
Idioma , Personalidade , Humanos , História do Século XX , História do Século XXI , História do Século XIX , Professores Escolares/psicologiaRESUMO
Most problems within and beyond the scientific domain can be framed into one of the following three levels of complexity of function approximation. Type 1: Approximate an unknown function given input/output data. Type 2: Consider a collection of variables and functions, some of which are unknown, indexed by the nodes and hyperedges of a hypergraph (a generalized graph where edges can connect more than two vertices). Given partial observations of the variables of the hypergraph (satisfying the functional dependencies imposed by its structure), approximate all the unobserved variables and unknown functions. Type 3: Expanding on Type 2, if the hypergraph structure itself is unknown, use partial observations of the variables of the hypergraph to discover its structure and approximate its unknown functions. These hypergraphs offer a natural platform for organizing, communicating, and processing computational knowledge. While most scientific problems can be framed as the data-driven discovery of unknown functions in a computational hypergraph whose structure is known (Type 2), many require the data-driven discovery of the structure (connectivity) of the hypergraph itself (Type 3). We introduce an interpretable Gaussian Process (GP) framework for such (Type 3) problems that does not require randomization of the data, access to or control over its sampling, or sparsity of the unknown functions in a known or learned basis. Its polynomial complexity, which contrasts sharply with the super-exponential complexity of causal inference methods, is enabled by the nonlinear ANOVA capabilities of GPs used as a sensing mechanism.
RESUMO
To accelerate the impact of African genomics on human health, data science skills and awareness of Africa's rich genetic diversity must be strengthened globally. We describe the first African genomics data science workshop, implemented by the African Society of Human Genetics (AfSHG) and international partners, providing a framework for future workshops.
Assuntos
Ciência de Dados , Genômica , Humanos , Genética HumanaRESUMO
10 years ago, a detailed analysis showed that only 33% of genome-wide association study (GWAS) results included the X chromosome. Multiple recommendations were made to combat such exclusion. Here, we re-surveyed the research landscape to determine whether these earlier recommendations had been translated. Unfortunately, among the genome-wide summary statistics reported in 2021 in the NHGRI-EBI GWAS Catalog, only 25% provided results for the X chromosome and 3% for the Y chromosome, suggesting that the exclusion phenomenon not only persists but has also expanded into an exclusionary problem. Normalizing by physical length of the chromosome, the average number of studies published through November 2022 with genome-wide-significant findings on the X chromosome is â¼1 study/Mb. By contrast, it ranges from â¼6 to â¼16 studies/Mb for chromosomes 4 and 19, respectively. Compared with the autosomal growth rate of â¼0.086 studies/Mb/year over the last decade, studies of the X chromosome grew at less than one-seventh that rate, only â¼0.012 studies/Mb/year. Among the studies that reported significant associations on the X chromosome, we noted extreme heterogeneities in data analysis and reporting of results, suggesting the need for clear guidelines. Unsurprisingly, among the 430 scores sampled from the PolyGenic Score Catalog, 0% contained weights for sex chromosomal SNPs. To overcome the dearth of sex chromosome analyses, we provide five sets of recommendations and future directions. Finally, until the sex chromosomes are included in a whole-genome study, instead of GWASs, we propose such studies would more properly be referred to as "AWASs," meaning "autosome-wide scans."
Assuntos
Estudo de Associação Genômica Ampla , Cromossomos Sexuais , Humanos , Estudo de Associação Genômica Ampla/métodos , Cromossomo Y , GenomaRESUMO
The ongoing release of large-scale sequencing data in the UK Biobank allows for the identification of associations between rare variants and complex traits. SAIGE-GENE+ is a valid approach to conducting set-based association tests for quantitative and binary traits. However, for ordinal categorical phenotypes, applying SAIGE-GENE+ with treating the trait as quantitative or binarizing the trait can cause inflated type I error rates or power loss. In this study, we propose a scalable and accurate method for rare-variant association tests, POLMM-GENE, in which we used a proportional odds logistic mixed model to characterize ordinal categorical phenotypes while adjusting for sample relatedness. POLMM-GENE fully utilizes the categorical nature of phenotypes and thus can well control type I error rates while remaining powerful. In the analyses of UK Biobank 450k whole-exome-sequencing data for five ordinal categorical traits, POLMM-GENE identified 54 gene-phenotype associations.
Assuntos
Exoma , Estudo de Associação Genômica Ampla , Estudo de Associação Genômica Ampla/métodos , Exoma/genética , Bancos de Espécimes Biológicos , Fenótipo , Análise de Dados , Reino UnidoRESUMO
With rapidly evolving high-throughput technologies and consistently decreasing costs, collecting multimodal omics data in large-scale studies has become feasible. Although studying multiomics provides a new comprehensive approach in understanding the complex biological mechanisms of human diseases, the high dimensionality of omics data and the complexity of the interactions among various omics levels in contributing to disease phenotypes present tremendous analytical challenges. There is a great need of novel analytical methods to address these challenges and to facilitate multiomics analyses. In this paper, we propose a multimodal functional deep learning (MFDL) method for the analysis of high-dimensional multiomics data. The MFDL method models the complex relationships between multiomics variants and disease phenotypes through the hierarchical structure of deep neural networks and handles high-dimensional omics data using the functional data analysis technique. Furthermore, MFDL leverages the structure of the multimodal model to capture interactions between different types of omics data. Through simulation studies and real-data applications, we demonstrate the advantages of MFDL in terms of prediction accuracy and its robustness to the high dimensionality and noise within the data.
Assuntos
Aprendizado Profundo , Genômica , Humanos , Genômica/métodos , Biologia Computacional/métodos , Redes Neurais de Computação , Algoritmos , MultiômicaRESUMO
Accurate and efficient prediction of polymers properties is crucial for polymer design. Recently, data-driven artificial intelligence (AI) models have demonstrated great promise in polymers property analysis. Even with the great progresses, a pivotal challenge in all the AI-driven models remains to be the effective representation of molecules. Here we introduce Multi-Cover Persistence (MCP)-based molecular representation and featurization for the first time. Our MCP-based polymer descriptors are combined with machine learning models, in particular, Gradient Boosting Tree (GBT) models, for polymers property prediction. Different from all previous molecular representation, polymer molecular structure and interactions are represented as MCP, which utilizes Delaunay slices at different dimensions and Rhomboid tiling to characterize the complicated geometric and topological information within the data. Statistic features from the generated persistent barcodes are used as polymer descriptors, and further combined with GBT model. Our model has been extensively validated on polymer benchmark datasets. It has been found that our models can outperform traditional fingerprint-based models and has similar accuracy with geometric deep learning models. In particular, our model tends to be more effective on large-sized monomer structures, demonstrating the great potential of MCP in characterizing more complicated polymer data. This work underscores the potential of MCP in polymer informatics, presenting a novel perspective on molecular representation and its application in polymer science.
Assuntos
Aprendizado de Máquina , Polímeros , Polímeros/química , AlgoritmosRESUMO
Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool to gain biological insights at the cellular level. However, due to technical limitations of the existing sequencing technologies, low gene expression values are often omitted, leading to inaccurate gene counts. Existing methods, including advanced deep learning techniques, struggle to reliably impute gene expressions due to a lack of mechanisms that explicitly consider the underlying biological knowledge of the system. In reality, it has long been recognized that gene-gene interactions may serve as reflective indicators of underlying biology processes, presenting discriminative signatures of the cells. A genomic data analysis framework that is capable of leveraging the underlying gene-gene interactions is thus highly desirable and could allow for more reliable identification of distinctive patterns of the genomic data through extraction and integration of intricate biological characteristics of the genomic data. Here we tackle the problem in two steps to exploit the gene-gene interactions of the system. We first reposition the genes into a 2D grid such that their spatial configuration reflects their interactive relationships. To alleviate the need for labeled ground truth gene expression datasets, a self-supervised 2D convolutional neural network is employed to extract the contextual features of the interactions from the spatially configured genes and impute the omitted values. Extensive experiments with both simulated and experimental scRNA-seq datasets are carried out to demonstrate the superior performance of the proposed strategy against the existing imputation methods.
Assuntos
Aprendizado Profundo , Epistasia Genética , Análise de Dados , Genômica , Expressão Gênica , Perfilação da Expressão Gênica , Análise de Sequência de RNARESUMO
Biomedical research now commonly integrates diverse data types or views from the same individuals to better understand the pathobiology of complex diseases, but the challenge lies in meaningfully integrating these diverse views. Existing methods often require the same type of data from all views (cross-sectional data only or longitudinal data only) or do not consider any class outcome in the integration method, which presents limitations. To overcome these limitations, we have developed a pipeline that harnesses the power of statistical and deep learning methods to integrate cross-sectional and longitudinal data from multiple sources. In addition, it identifies key variables that contribute to the association between views and the separation between classes, providing deeper biological insights. This pipeline includes variable selection/ranking using linear and nonlinear methods, feature extraction using functional principal component analysis and Euler characteristics, and joint integration and classification using dense feed-forward networks for cross-sectional data and recurrent neural networks for longitudinal data. We applied this pipeline to cross-sectional and longitudinal multiomics data (metagenomics, transcriptomics and metabolomics) from an inflammatory bowel disease (IBD) study and identified microbial pathways, metabolites and genes that discriminate by IBD status, providing information on the etiology of IBD. We conducted simulations to compare the two feature extraction methods.
Assuntos
Aprendizado Profundo , Doenças Inflamatórias Intestinais , Humanos , Estudos Transversais , Doenças Inflamatórias Intestinais/classificação , Doenças Inflamatórias Intestinais/genética , Estudos Longitudinais , Análise Discriminante , Metabolômica/métodos , Biologia Computacional/métodosRESUMO
The high-throughput genomic and proteomic scanning approaches allow investigators to measure the quantification of genome-wide genes (or gene products) for certain disease conditions, which plays an essential role in promoting the discovery of disease mechanisms. The high-throughput approaches often generate a large gene list of interest (GOIs), such as differentially expressed genes/proteins. However, researchers have to perform manual triage and validation to explore the most promising, biologically plausible linkages between the known disease genes and GOIs (disease signals) for further study. Here, to address this challenge, we proposed a network-based strategy DDK-Linker to facilitate the exploration of disease signals hidden in omics data by linking GOIs to disease knowns genes. Specifically, it reconstructed gene distances in the protein-protein interaction (PPI) network through six network methods (random walk with restart, Deepwalk, Node2Vec, LINE, HOPE, Laplacian) to discover disease signals in omics data that have shorter distances to disease genes. Furthermore, benefiting from the establishment of knowledge base we established, the abundant bioinformatics annotations were provided for each candidate disease signal. To assist in omics data interpretation and facilitate the usage, we have developed this strategy into an application that users can access through a website or download the R package. We believe DDK-Linker will accelerate the exploring of disease genes and drug targets in a variety of omics data, such as genomics, transcriptomics and proteomics data, and provide clues for complex disease mechanism and pharmacological research. DDK-Linker is freely accessible at http://ddklinker.ncpsb.org.cn/.
Assuntos
Proteômica , Software , Proteômica/métodos , Genômica/métodos , Biologia Computacional/métodos , Mapas de Interação de ProteínasRESUMO
In cancer genomics, variant calling has advanced, but traditional mean accuracy evaluations are inadequate for biomarkers like tumor mutation burden, which vary significantly across samples, affecting immunotherapy patient selection and threshold settings. In this study, we introduce TMBstable, an innovative method that dynamically selects optimal variant calling strategies for specific genomic regions using a meta-learning framework, distinguishing it from traditional callers with uniform sample-wide strategies. The process begins with segmenting the sample into windows and extracting meta-features for clustering, followed by using a pre-trained meta-model to select suitable algorithms for each cluster, thereby addressing strategy-sample mismatches, reducing performance fluctuations and ensuring consistent performance across various samples. We evaluated TMBstable using both simulated and real non-small cell lung cancer and nasopharyngeal carcinoma samples, comparing it with advanced callers. The assessment, focusing on stability measures, such as the variance and coefficient of variation in false positive rate, false negative rate, precision and recall, involved 300 simulated and 106 real tumor samples. Benchmark results showed TMBstable's superior stability with the lowest variance and coefficient of variation across performance metrics, highlighting its effectiveness in analyzing the counting-based biomarker. The TMBstable algorithm can be accessed at https://github.com/hello-json/TMBstable for academic usage only.
Assuntos
Carcinoma Pulmonar de Células não Pequenas , Neoplasias Pulmonares , Humanos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genômica/métodos , Genoma , AlgoritmosRESUMO
RNA sequencing is the gold-standard method to quantify transcriptomic changes between two conditions. The overwhelming majority of data analysis methods available are focused on polyadenylated RNA transcribed from single-copy genes and overlook transcripts from repeated sequences such as transposable elements (TEs). These self-autonomous genetic elements are increasingly studied, and specialized tools designed to handle multimapping sequencing reads are available. Transfer RNAs are transcribed by RNA polymerase III and are essential for protein translation. There is a need for integrated software that is able to analyze multiple types of RNA. Here, we present 3t-seq, a Snakemake pipeline for integrated differential expression analysis of transcripts from single-copy genes, TEs, and tRNA. 3t-seq produces an accessible report and easy-to-use results for downstream analysis starting from raw sequencing data and performing quality control, genome mapping, gene expression quantification, and statistical testing. It implements three methods to quantify TEs expression and one for tRNA genes. It provides an easy-to-configure method to manage software dependencies that lets the user focus on results. 3t-seq is released under MIT license and is available at https://github.com/boulardlab/3t-seq.
Assuntos
Elementos de DNA Transponíveis , RNA de Transferência , RNA-Seq , Software , RNA de Transferência/genética , RNA-Seq/métodos , Perfilação da Expressão Gênica/métodos , Humanos , Biologia Computacional/métodos , Análise de Sequência de RNA/métodosRESUMO
Data-independent acquisition (DIA) mass spectrometry (MS) has emerged as a powerful technology for high-throughput, accurate, and reproducible quantitative proteomics. This review provides a comprehensive overview of recent advances in both the experimental and computational methods for DIA proteomics, from data acquisition schemes to analysis strategies and software tools. DIA acquisition schemes are categorized based on the design of precursor isolation windows, highlighting wide-window, overlapping-window, narrow-window, scanning quadrupole-based, and parallel accumulation-serial fragmentation-enhanced DIA methods. For DIA data analysis, major strategies are classified into spectrum reconstruction, sequence-based search, library-based search, de novo sequencing, and sequencing-independent approaches. A wide array of software tools implementing these strategies are reviewed, with details on their overall workflows and scoring approaches at different steps. The generation and optimization of spectral libraries, which are critical resources for DIA analysis, are also discussed. Publicly available benchmark datasets covering global proteomics and phosphoproteomics are summarized to facilitate performance evaluation of various software tools and analysis workflows. Continued advances and synergistic developments of versatile components in DIA workflows are expected to further enhance the power of DIA-based proteomics.
Assuntos
Proteômica , Software , Proteômica/métodos , Espectrometria de Massas/métodos , Biblioteca Gênica , Proteoma/análiseRESUMO
In the era of open-modification search engines, more posttranslational modifications than ever can be detected by LC-MS/MS-based proteomics. This development can switch proteomics research into a higher gear, as PTMs are key in many cellular pathways important in cell proliferation, migration, metastasis, and aging. However, despite these advances in modification identification, statistical methods for PTM-level quantification and differential analysis have yet to catch up. This absence can partly be explained by statistical challenges inherent to the data, such as the confounding of PTM intensities with its parent protein abundance. Therefore, we have developed msqrob2PTM, a new workflow in the msqrob2 universe capable of differential abundance analysis at the PTM and at the peptidoform level. The latter is important for validating PTMs found as significantly differential. Indeed, as our method can deal with multiple PTMs per peptidoform, there is a possibility that significant PTMs stem from one significant peptidoform carrying another PTM, hinting that it might be the other PTM driving the perceived differential abundance. Our workflows can flag both differential peptidoform abundance (DPA) and differential peptidoform usage (DPU). This enables a distinction between direct assessment of differential abundance of peptidoforms (DPA) and differences in the relative usage of peptidoforms corrected for corresponding protein abundances (DPU). For DPA, we directly model the log2-transformed peptidoform intensities, while for DPU, we correct for parent protein abundance by an intermediate normalization step which calculates the log2-ratio of the peptidoform intensities to their summarized parent protein intensities. We demonstrated the utility and performance of msqrob2PTM by applying it to datasets with known ground truth, as well as to biological PTM-rich datasets. Our results show that msqrob2PTM is on par with, or surpassing the performance of, the current state-of-the-art methods. Moreover, msqrob2PTM is currently unique in providing output at the peptidoform level.
Assuntos
Proteômica , Espectrometria de Massas em Tandem , Proteômica/métodos , Cromatografia Líquida , Processamento de Proteína Pós-Traducional , ProteínasRESUMO
Principal component analysis (PCA) is a dimensionality reduction method that is known for being simple and easy to interpret. Principal components are often interpreted as low-dimensional patterns in high-dimensional space. However, this simple interpretation fails for timeseries, spatial maps, and other continuous data. In these cases, nonoscillatory data may have oscillatory principal components. Here, we show that two common properties of data cause oscillatory principal components: smoothness and shifts in time or space. These two properties implicate almost all neuroscience data. We show how the oscillations produced by PCA, which we call "phantom oscillations," impact data analysis. We also show that traditional cross-validation does not detect phantom oscillations, so we suggest procedures that do. Our findings are supported by a collection of mathematical proofs. Collectively, our work demonstrates that patterns which emerge from high-dimensional data analysis may not faithfully represent the underlying data.
RESUMO
It is widely accepted that there is an inextricable link between neural computations, biological mechanisms, and behavior, but it is challenging to simultaneously relate all three. Here, we show that topological data analysis (TDA) provides an important bridge between these approaches to studying how brains mediate behavior. We demonstrate that cognitive processes change the topological description of the shared activity of populations of visual neurons. These topological changes constrain and distinguish between competing mechanistic models, are connected to subjects' performance on a visual change detection task, and, via a link with network control theory, reveal a tradeoff between improving sensitivity to subtle visual stimulus changes and increasing the chance that the subject will stray off task. These connections provide a blueprint for using TDA to uncover the biological and computational mechanisms by which cognition affects behavior in health and disease.
Assuntos
Encéfalo , Cognição , Humanos , Cognição/fisiologia , Encéfalo/fisiologia , Neurônios/fisiologiaRESUMO
Homeostasis, the ability to maintain a relatively constant internal environment in the face of perturbations, is a hallmark of biological systems. It is believed that this constancy is achieved through multiple internal regulation and control processes. Given observations of a system, or even a detailed model of one, it is both valuable and extremely challenging to extract the control objectives of the homeostatic mechanisms. In this work, we develop a robust data-driven method to identify these objectives, namely to understand: "what does the system care about?". We propose an algorithm, Identifying Regulation with Adversarial Surrogates (IRAS), that receives an array of temporal measurements of the system and outputs a candidate for the control objective, expressed as a combination of observed variables. IRAS is an iterative algorithm consisting of two competing players. The first player, realized by an artificial deep neural network, aims to minimize a measure of invariance we refer to as the coefficient of regulation. The second player aims to render the task of the first player more difficult by forcing it to extract information about the temporal structure of the data, which is absent from similar "surrogate" data. We test the algorithm on four synthetic and one natural data set, demonstrating excellent empirical results. Interestingly, our approach can also be used to extract conserved quantities, e.g., energy and momentum, in purely physical systems, as we demonstrate empirically.