Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 6 de 6
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Front Bioinform ; 3: 1191961, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37600970

RESUMEN

Important quantities of biological data can today be acquired to characterize cell types and states, from various sources and using a wide diversity of methods, providing scientists with more and more information to answer challenging biological questions. Unfortunately, working with this amount of data comes at the price of ever-increasing data complexity. This is caused by the multiplication of data types and batch effects, which hinders the joint usage of all available data within common analyses. Data integration describes a set of tasks geared towards embedding several datasets of different origins or modalities into a joint representation that can then be used to carry out downstream analyses. In the last decade, dozens of methods have been proposed to tackle the different facets of the data integration problem, relying on various paradigms. This review introduces the most common data types encountered in computational biology and provides systematic definitions of the data integration problems. We then present how machine learning innovations were leveraged to build effective data integration algorithms, that are widely used today by computational biologists. We discuss the current state of data integration and important pitfalls to consider when working with data integration tools. We eventually detail a set of challenges the field will have to overcome in the coming years.

2.
NAR Genom Bioinform ; 5(3): lqad069, 2023 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-37448589

RESUMEN

Data integration of single-cell RNA-seq (scRNA-seq) data describes the task of embedding datasets gathered from different sources or experiments into a common representation so that cells with similar types or states are embedded close to one another independently from their dataset of origin. Data integration is a crucial step in most scRNA-seq data analysis pipelines involving multiple batches. It improves data visualization, batch effect reduction, clustering, label transfer, and cell type inference. Many data integration tools have been proposed during the last decade, but a surge in the number of these methods has made it difficult to pick one for a given use case. Furthermore, these tools are provided as rigid pieces of software, making it hard to adapt them to various specific scenarios. In order to address both of these issues at once, we introduce the transmorph framework. It allows the user to engineer powerful data integration pipelines and is supported by a rich software ecosystem. We demonstrate transmorph usefulness by solving a variety of practical challenges on scRNA-seq datasets including joint datasets embedding, gene space integration, and transfer of cycle phase annotations. transmorph is provided as an open source python package.

3.
Entropy (Basel) ; 25(1)2022 Dec 24.
Artículo en Inglés | MEDLINE | ID: mdl-36673174

RESUMEN

Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets into a common space in which the source dataset is informative for training while the divergence between source and target is minimized. The most popular domain adaptation solutions are based on training neural networks that combine classification and adversarial learning modules, frequently making them both data-hungry and difficult to train. We present a method called Domain Adaptation Principal Component Analysis (DAPCA) that identifies a linear reduced data representation useful for solving the domain adaptation task. DAPCA algorithm introduces positive and negative weights between pairs of data points, and generalizes the supervised extension of principal component analysis. DAPCA is an iterative algorithm that solves a simple quadratic optimization problem at each iteration. The convergence of the algorithm is guaranteed, and the number of iterations is small in practice. We validate the suggested algorithm on previously proposed benchmarks for solving the domain adaptation task. We also show the benefit of using DAPCA in analyzing single-cell omics datasets in biomedical applications. Overall, DAPCA can serve as a practical preprocessing step in many machine learning applications leading to reduced dataset representations, taking into account possible divergence between source and target domains.

4.
Front Mol Biosci ; 8: 793912, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-35178429

RESUMEN

Cell cycle is a biological process underlying the existence and propagation of life in time and space. It has been an object for mathematical modeling for long, with several alternative mechanistic modeling principles suggested, describing in more or less details the known molecular mechanisms. Recently, cell cycle has been investigated at single cell level in snapshots of unsynchronized cell populations, exploiting the new methods for transcriptomic and proteomic molecular profiling. This raises a need for simplified semi-phenomenological cell cycle models, in order to formalize the processes underlying the cell cycle, at a higher abstracted level. Here we suggest a modeling framework, recapitulating the most important properties of the cell cycle as a limit trajectory of a dynamical process characterized by several internal states with switches between them. In the simplest form, this leads to a limit cycle trajectory, composed by linear segments in logarithmic coordinates describing some extensive (depending on system size) cell properties. We prove a theorem connecting the effective embedding dimensionality of the cell cycle trajectory with the number of its linear segments. We also develop a simplified kinetic model with piecewise-constant kinetic rates describing the dynamics of lumps of genes involved in S-phase and G2/M phases. We show how the developed cell cycle models can be applied to analyze the available single cell datasets and simulate certain properties of the observed cell cycle trajectories. Based on our model, we can predict with good accuracy the cell line doubling time from the length of cell cycle trajectory.

5.
Cell Rep ; 30(6): 1767-1779.e6, 2020 02 11.
Artículo en Inglés | MEDLINE | ID: mdl-32049009

RESUMEN

EWSR1-FLI1, the chimeric oncogene specific for Ewing sarcoma (EwS), induces a cascade of signaling events leading to cell transformation. However, it remains elusive how genetically homogeneous EwS cells can drive the heterogeneity of transcriptional programs. Here, we combine independent component analysis of single-cell RNA sequencing data from diverse cell types and model systems with time-resolved mapping of EWSR1-FLI1 binding sites and of open chromatin regions to characterize dynamic cellular processes associated with EWSR1-FLI1 activity. We thus define an exquisitely specific and direct enhancer-driven EWSR1-FLI1 program. In EwS tumors, cell proliferation and strong oxidative phosphorylation metabolism are associated with a well-defined range of EWSR1-FLI1 activity. In contrast, a subpopulation of cells from below and above the intermediary EWSR1-FLI1 activity is characterized by increased hypoxia. Overall, our study reveals sources of intratumoral heterogeneity within EwS tumors.


Asunto(s)
Regulación Neoplásica de la Expresión Génica/genética , Proteína EWS de Unión a ARN/metabolismo , Sarcoma de Ewing/genética , Transcripción Genética/genética , Línea Celular Tumoral , Humanos , Transducción de Señal
6.
Artif Intell Med ; 99: 101690, 2019 08.
Artículo en Inglés | MEDLINE | ID: mdl-31606112

RESUMEN

In order to gain insight into oligogenic disorders, understanding those involving bi-locus variant combinations appears to be key. In prior work, we showed that features at multiple biological scales can already be used to discriminate among two types, i.e. disorders involving true digenic and modifier combinations. The current study expands this machine learning work towards dual molecular diagnosis cases, providing a classifier able to effectively distinguish between these three types. To reach this goal and gain an in-depth understanding of the decision process, game theory and tree decomposition techniques are applied to random forest predictors to investigate the relevance of feature combinations in the prediction. A machine learning model with high discrimination capabilities was developed, effectively differentiating the three classes in a biologically meaningful manner. Combining prediction interpretation and statistical analysis, we propose a biologically meaningful characterization of each class relying on specific feature strengths. Figuring out how biological characteristics shift samples towards one of three classes provides clinically relevant insight into the underlying biological processes as well as the disease itself.


Asunto(s)
Teoría del Juego , Predisposición Genética a la Enfermedad/genética , Aprendizaje Automático , Herencia Multifactorial/genética , Árboles de Decisión , Humanos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...