Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 16 de 16
Filtrar
1.
Bioinformatics ; 39(39 Suppl 1): i111-i120, 2023 06 30.
Artículo en Inglés | MEDLINE | ID: mdl-37387181

RESUMEN

MOTIVATION: Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models' full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes. RESULTS: This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly. AVAILABILITY AND IMPLEMENTATION: All data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository: https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics.


Asunto(s)
Perfilación de la Expresión Génica , Transcriptoma , RNA-Seq , Exactitud de los Datos , Fenotipo
2.
Bioinformatics ; 39(39 Suppl 1): i94-i102, 2023 06 30.
Artículo en Inglés | MEDLINE | ID: mdl-37387182

RESUMEN

MOTIVATION: The increasing availability of high-throughput omics data allows for considering a new medicine centered on individual patients. Precision medicine relies on exploiting these high-throughput data with machine-learning models, especially the ones based on deep-learning approaches, to improve diagnosis. Due to the high-dimensional small-sample nature of omics data, current deep-learning models end up with many parameters and have to be fitted with a limited training set. Furthermore, interactions between molecular entities inside an omics profile are not patient specific but are the same for all patients. RESULTS: In this article, we propose AttOmics, a new deep-learning architecture based on the self-attention mechanism. First, we decompose each omics profile into a set of groups, where each group contains related features. Then, by applying the self-attention mechanism to the set of groups, we can capture the different interactions specific to a patient. The results of different experiments carried out in this article show that our model can accurately predict the phenotype of a patient with fewer parameters than deep neural networks. Visualizing the attention maps can provide new insights into the essential groups for a particular phenotype. AVAILABILITY AND IMPLEMENTATION: The code and data are available at https://forge.ibisc.univ-evry.fr/abeaude/AttOmics. TCGA data can be downloaded from the Genomic Data Commons Data Portal.


Asunto(s)
Aprendizaje Automático , Redes Neurales de la Computación , Fenotipo , Medicina de Precisión
3.
Bioinformatics ; 38(9): 2504-2511, 2022 04 28.
Artículo en Inglés | MEDLINE | ID: mdl-35266505

RESUMEN

MOTIVATION: Medical care is becoming more and more specific to patients' needs due to the increased availability of omics data. The application to these data of sophisticated machine learning models, in particular deep learning (DL), can improve the field of precision medicine. However, their use in clinics is limited as their predictions are not accompanied by an explanation. The production of accurate and intelligible predictions can benefit from the inclusion of domain knowledge. Therefore, knowledge-based DL models appear to be a promising solution. RESULTS: In this article, we propose GraphGONet, where the Gene Ontology is encapsulated in the hidden layers of a new self-explaining neural network. Each neuron in the layers represents a biological concept, combining the gene expression profile of a patient and the information from its neighboring neurons. The experiments described in the article confirm that our model not only performs as accurately as the state-of-the-art (non-explainable ones) but also automatically produces stable and intelligible explanations composed of the biological concepts with the highest contribution. This feature allows experts to use our tool in a medical setting. AVAILABILITY AND IMPLEMENTATION: GraphGONet is freely available at https://forge.ibisc.univ-evry.fr/vbourgeais/GraphGONet.git. The microarray dataset is accessible from the ArrayExpress database under the identifier E-MTAB-3732. The TCGA datasets can be downloaded from the Genomic Data Commons (GDC) data portal. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Aprendizaje Automático , Redes Neurales de la Computación , Ontología de Genes , Fenotipo , Expresión Génica
4.
BMC Bioinformatics ; 23(1): 262, 2022 Jul 03.
Artículo en Inglés | MEDLINE | ID: mdl-35786378

RESUMEN

BACKGROUND: Machine learning is now a standard tool for cancer prediction based on gene expression data. However, deep learning is still new for this task, and there is no clear consensus about its performance and utility. Few experimental works have evaluated deep neural networks and compared them with state-of-the-art machine learning. Moreover, their conclusions are not consistent. RESULTS: We extensively evaluate the deep learning approach on 22 cancer prediction tasks based on gene expression data. We measure the impact of the main hyper-parameters and compare the performances of neural networks with the state-of-the-art. We also investigate the effectiveness of several transfer learning schemes in different experimental setups. CONCLUSION: Based on our experimentations, we provide several recommendations to optimize the construction and training of a neural network model. We show that neural networks outperform the state-of-the-art methods only for very large training set size. For a small training set, we show that transfer learning is possible and may strongly improve the model performance in some cases.


Asunto(s)
Aprendizaje Profundo , Neoplasias , Expresión Génica , Humanos , Aprendizaje Automático , Neoplasias/genética , Redes Neurales de la Computación
5.
BMC Bioinformatics ; 22(Suppl 10): 455, 2021 Sep 22.
Artículo en Inglés | MEDLINE | ID: mdl-34551707

RESUMEN

BACKGROUND: With the rapid advancement of genomic sequencing techniques, massive production of gene expression data is becoming possible, which prompts the development of precision medicine. Deep learning is a promising approach for phenotype prediction (clinical diagnosis, prognosis, and drug response) based on gene expression profile. Existing deep learning models are usually considered as black-boxes that provide accurate predictions but are not interpretable. However, accuracy and interpretation are both essential for precision medicine. In addition, most models do not integrate the knowledge of the domain. Hence, making deep learning models interpretable for medical applications using prior biological knowledge is the main focus of this paper. RESULTS: In this paper, we propose a new self-explainable deep learning model, called Deep GONet, integrating the Gene Ontology into the hierarchical architecture of the neural network. This model is based on a fully-connected architecture constrained by the Gene Ontology annotations, such that each neuron represents a biological function. The experiments on cancer diagnosis datasets demonstrate that Deep GONet is both easily interpretable and highly performant to discriminate cancer and non-cancer samples. CONCLUSIONS: Our model provides an explanation to its predictions by identifying the most important neurons and associating them with biological functions, making the model understandable for biologists and physicians.


Asunto(s)
Neoplasias , Redes Neurales de la Computación , Expresión Génica , Ontología de Genes , Humanos , Fenotipo
6.
BMC Bioinformatics ; 21(1): 501, 2020 Nov 04.
Artículo en Inglés | MEDLINE | ID: mdl-33148191

RESUMEN

BACKGROUND: The use of predictive gene signatures to assist clinical decision is becoming more and more important. Deep learning has a huge potential in the prediction of phenotype from gene expression profiles. However, neural networks are viewed as black boxes, where accurate predictions are provided without any explanation. The requirements for these models to become interpretable are increasing, especially in the medical field. RESULTS: We focus on explaining the predictions of a deep neural network model built from gene expression data. The most important neurons and genes influencing the predictions are identified and linked to biological knowledge. Our experiments on cancer prediction show that: (1) deep learning approach outperforms classical machine learning methods on large training sets; (2) our approach produces interpretations more coherent with biology than the state-of-the-art based approaches; (3) we can provide a comprehensive explanation of the predictions for biologists and physicians. CONCLUSION: We propose an original approach for biological interpretation of deep learning models for phenotype prediction from gene expression data. Since the model can find relationships between the phenotype and gene expression, we may assume that there is a link between the identified genes and the phenotype. The interpretation can, therefore, lead to new biological hypotheses to be investigated by biologists.


Asunto(s)
Regulación Neoplásica de la Expresión Génica , Neoplasias/patología , Redes Neurales de la Computación , Bases de Datos Genéticas , Humanos , Neoplasias/metabolismo
7.
PLoS One ; 18(5): e0286137, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37228138

RESUMEN

In the sea of data generated daily, unlabeled samples greatly outnumber labeled ones. This is due to the fact that, in many application areas, labels are scarce or hard to obtain. In addition, unlabeled samples might belong to new classes that are not available in the label set associated with data. In this context, we propose A3SOM, an abstained explainable semi-supervised neural network that associates a self-organizing map to dense layers in order to classify samples. Abstained classification enables the detection of new classes and class overlaps. The use of a self-organizing map in A3SOM allows integrated visualization and makes the model explainable. Along with describing our approach, this paper shows that the method is competitive with other classifiers and demonstrates the benefits of including abstention rules. A use case is presented on breast cancer subtype classification and discovery to show the relevance of our method in real-world medical problems.


Asunto(s)
Algoritmos , Redes Neurales de la Computación , Aprendizaje Automático Supervisado
8.
Bioinformatics ; 26(6): 822-30, 2010 Mar 15.
Artículo en Inglés | MEDLINE | ID: mdl-20130029

RESUMEN

MOTIVATION: The receiver operator characteristic (ROC) curves are commonly used in biomedical applications to judge the performance of a discriminant across varying decision thresholds. The estimated ROC curve depends on the true positive rate (TPR) and false positive rate (FPR), with the key metric being the area under the curve (AUC). With small samples these rates need to be estimated from the training data, so a natural question arises: How well do the estimates of the AUC, TPR and FPR compare with the true metrics? RESULTS: Through a simulation study using data models and analysis of real microarray data, we show that (i) for small samples the root mean square differences of the estimated and true metrics are considerable; (ii) even for large samples, there is only weak correlation between the true and estimated metrics; and (iii) generally, there is weak regression of the true metric on the estimated metric. For classification rules, we consider linear discriminant analysis, linear support vector machine (SVM) and radial basis function SVM. For error estimation, we consider resubstitution, three kinds of cross-validation and bootstrap. Using resampling, we show the unreliability of some published ROC results. AVAILABILITY: Companion web site at http://compbio.tgen.org/paper_supp/ROC/roc.html CONTACT: edward@mail.ece.tamu.edu.


Asunto(s)
Algoritmos , Análisis de Secuencia por Matrices de Oligonucleótidos , Reacciones Falso Positivas , Reconocimiento de Normas Patrones Automatizadas/métodos , Curva ROC
9.
Gigascience ; 9(3)2020 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-32150601

RESUMEN

BACKGROUND: Microbiome biomarker discovery for patient diagnosis, prognosis, and risk evaluation is attracting broad interest. Selected groups of microbial features provide signatures that characterize host disease states such as cancer or cardio-metabolic diseases. Yet, the current predictive models stemming from machine learning still behave as black boxes and seldom generalize well. Their interpretation is challenging for physicians and biologists, which makes them difficult to trust and use routinely in the physician-patient decision-making process. Novel methods that provide interpretability and biological insight are needed. Here, we introduce "predomics", an original machine learning approach inspired by microbial ecosystem interactions that is tailored for metagenomics data. It discovers accurate predictive signatures and provides unprecedented interpretability. The decision provided by the predictive model is based on a simple, yet powerful score computed by adding, subtracting, or dividing cumulative abundance of microbiome measurements. RESULTS: Tested on >100 datasets, we demonstrate that predomics models are simple and highly interpretable. Even with such simplicity, they are at least as accurate as state-of-the-art methods. The family of best models, discovered during the learning process, offers the ability to distil biological information and to decipher the predictability signatures of the studied condition. In a proof-of-concept experiment, we successfully predicted body corpulence and metabolic improvement after bariatric surgery using pre-surgery microbiome data. CONCLUSIONS: Predomics is a new algorithm that helps in providing reliable and trustworthy diagnostic decisions in the microbiome field. Predomics is in accord with societal and legal requirements that plead for an explainable artificial intelligence approach in the medical field.


Asunto(s)
Microbioma Gastrointestinal/genética , Metagenoma , Metagenómica/métodos , Humanos , Modelos Genéticos , Máquina de Vectores de Soporte
10.
Bioinformatics ; 24(17): 1889-95, 2008 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-18621758

RESUMEN

MOTIVATION: The classification methods typically used in bioinformatics classify all examples, even if the classification is ambiguous, for instance, when the example is close to the separating hyperplane in linear classification. For medical applications, it may be better to classify an example only when there is a sufficiently high degree of accuracy, rather than classify all examples with decent accuracy. Moreover, when all examples are classified, the classification rule has no control over the accuracy of the classifier; the algorithm just aims to produce a classifier with the smallest error rate possible. In our approach, we fix the accuracy of the classifier and thereby choose a desired risk of error. RESULTS: Our method consists of defining a rejection region in the feature space. This region contains the examples for which classification is ambiguous. These are rejected by the classifier. The accuracy of the classifier becomes a user-defined parameter of the classification rule. The task of the classification rule is to minimize the rejection region with the constraint that the error rate of the classifier be bounded by the chosen target error. This approach is also used in the feature-selection step. The results computed on both synthetic and real data show that classifier accuracy is significantly improved. AVAILABILITY: Companion Website. http://gsp.tamu.edu/Publications/rejectoption/


Asunto(s)
Algoritmos , Artefactos , Inteligencia Artificial , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Reproducibilidad de los Resultados , Sensibilidad y Especificidad
11.
Bioinformatics ; 23(21): 2866-72, 2007 Nov 01.
Artículo en Inglés | MEDLINE | ID: mdl-17925306

RESUMEN

MOTIVATION: Microarray experiments that allow simultaneous expression profiling of thousands of genes in various conditions (tissues, cells or time) generate data whose analysis raises difficult problems. In particular, there is a vast disproportion between the number of attributes (tens of thousands) and the number of examples (several tens). Dimension reduction is therefore a key step before applying classification approaches. Many methods have been proposed to this purpose, but only a few of them considered a direct quantification of transcriptional interactions. We describe and experimentally validate a new dimension reduction and feature construction method, which assesses interactions between expression profiles to improve microarray-based classification accuracy. RESULTS: Our approach relies on a mutual information measure that exposes some elementary constituents of the information contained in a pair of gene expression profiles. We show that their analysis implies a term that represents the information of the interaction between the two genes. The principle of our method, called FeatKNN, is to exploit the information provided by highly synergic gene pairs to improve classification accuracy. First, a heuristic search selects the most informative gene pairs. Then, for each selected pair, a new feature, representing the classification margin of a KNN classifier in the gene pairs space, is constructed. We show experimentally that the interactional information has a degree of significance comparable to that of the gene expression profiles considered separately. Our method has been tested with different classifiers and yielded significant improvements in accuracy on several public microarray databases. Moreover, a synthetic assessment of the biological significance of the concept of synergic gene pairs suggested its ability to uncover relevant mechanisms underlying interactions among various cellular processes.


Asunto(s)
Algoritmos , Inteligencia Artificial , Análisis por Conglomerados , Interpretación Estadística de Datos , Perfilación de la Expresión Génica/métodos , Familia de Multigenes/fisiología , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos
12.
J Clin Endocrinol Metab ; 89(5): 2000-14, 2004 May.
Artículo en Inglés | MEDLINE | ID: mdl-15126512

RESUMEN

The stress hormone epinephrine produces major physiological effects on skeletal muscle. Here we determined skeletal muscle mRNA expression profiles before and during a 6-h epinephrine infusion performed in nine young men. Stringent statistical analysis of data obtained using 43000 cDNA element microarrays showed that 1206 and 474 genes were up- and down-regulated, respectively. Microarray data were validated using reverse transcription quantitative PCR. Gene classification was performed through data mining of Gene Ontology annotations, cluster analysis of regulated genes among 14 human tissues, and correlation analysis of mRNA and clinical parameter variations. Evidence of an autoregulatory control was provided by the regulation of key genes of the cAMP-dependent transcription pathway. Genes with known functional cAMP response elements were regulated by the hormone. The impact on metabolism was illustrated by coordinated regulations of genes involved in carbohydrate and protein metabolisms. Epinephrine had a profound effect on genes involved in immunity and inflammatory response, a previously unappreciated aspect of catecholamine action. Information on 526 mRNAs corresponded to genes of unknown function. These data define the molecular signatures of epinephrine action in human skeletal muscle. They may contribute to the understanding of skeletal muscle alterations observed in pathological conditions characterized by sympathetic nervous system overdrive.


Asunto(s)
Agonistas Adrenérgicos/metabolismo , Epinefrina/fisiología , Músculo Esquelético/fisiología , Agonistas Adrenérgicos/farmacología , Adulto , Fenómenos Fisiológicos Cardiovasculares , AMP Cíclico/metabolismo , Proteína de Unión a Elemento de Respuesta al AMP Cíclico/fisiología , Metabolismo Energético/fisiología , Epinefrina/farmacología , Regulación de la Expresión Génica/efectos de los fármacos , Humanos , Masculino , Metabolismo/fisiología , Análisis de Secuencia por Matrices de Oligonucleótidos , Proteínas/metabolismo , ARN Mensajero/metabolismo , Transducción de Señal , Estrés Fisiológico/genética , Regulación hacia Arriba
13.
Artículo en Inglés | MEDLINE | ID: mdl-22291161

RESUMEN

One of the major aims of many microarray experiments is to build discriminatory diagnosis and prognosis models. A large number of supervised methods have been proposed in literature for microarray-based classification for this purpose. Model evaluation and comparison is a critical issue and, the most of the time, is based on the classification cost. This classification cost is based on the costs of false positives and false negative, that are generally unknown in diagnostics problems. This uncertainty may highly impact the evaluation and comparison of the classifiers. We propose a new measure of classifier performance that takes account of the uncertainty of the error. We represent the available knowledge about the costs by a distribution function defined on the ratio of the costs. The performance of a classifier is therefore computed over the set of all possible costs weighted by their probability distribution. Our method is tested on both artificial and real microarray data sets. We show that the performance of classifiers is very depending of the ratio of the classification costs. In many cases, the best classifier can be identified by our new measure whereas the classic error measures fail.


Asunto(s)
Expresión Génica , Reconocimiento de Normas Patrones Automatizadas/métodos , Perfilación de la Expresión Génica/métodos , Modelos Teóricos , Análisis de Secuencia por Matrices de Oligonucleótidos
14.
Int J Bioinform Res Appl ; 6(6): 628-42, 2010.
Artículo en Inglés | MEDLINE | ID: mdl-21354968

RESUMEN

Microarray experiments can be used for simultaneous expression of thousands of genes in various conditions. Data from these experiments are used to identify the gene involved in a particular biological phenomenon. Most current methods for such analysis assume that genes are independent. We explored the interaction between genes to identify informative gene pairs. This was based on measuring the interaction information using the information theory. We show that there are two kinds of gene interaction, redundancy and synergy. We analysed these interactions to construct a network of redundancy and conducted a functional analysis of synergic components on two public datasets.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Genes , Reconocimiento de Normas Patrones Automatizadas/métodos
15.
Am J Clin Nutr ; 89(1): 51-7, 2009 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-19056587

RESUMEN

BACKGROUND: Adipose tissue gene expression analysis in humans now provides a tremendous means to discover the physiopathologic gene targets critical for our understanding and treatment of obesity. Clinical studies are emerging in which adipose gene expression has been examined in hundreds of subjects, and it will be fundamentally important that these studies can be compared so that a common consensus can be reached and new therapeutic targets for obesity proposed. OBJECTIVE: We studied the effect of the biopsy sampling methods (needle-aspirated and surgical) used in clinical investigation programs on the functional interpretation of adipose tissue gene expression profiles. DESIGN: A comparative microarray analysis of the different subcutaneous adipose tissue sampling methods was performed in age-matched lean (n = 19) and obese (n = 18) female subjects. Appropriate statistical (principal components analysis) and bioinformatic (FunNet) functional enrichment software were used to evaluate data. The morphology of adipose tissue samples obtained by needle-aspiration and surgical methods was examined by immunohistochemistry. RESULTS: Biopsy techniques influence the gene expression underlying the biological themes currently discussed in obesity (eg, inflammation, extracellular matrix, and metabolism). Immunohistochemistry experiments showed that the easier to obtain needle-aspirated biopsies poorly aspirate the fibrotic fraction of subcutaneous adipose tissue, resulting in an underrepresentation of the stroma-vascular fraction. CONCLUSIONS: The adipose tissue biopsy technique is an important caveat to consider when designing, interpreting, and, most important, comparing microarray experiments. These results will have crucial implications for the clinical and physiopathologic understanding of human obesity and therapeutic approaches.


Asunto(s)
Biopsia/métodos , Perfilación de la Expresión Génica/métodos , Obesidad/genética , Obesidad/patología , Análisis de Secuencia por Matrices de Oligonucleótidos , Adulto , Biopsia/instrumentación , Biopsia con Aguja/instrumentación , Biopsia con Aguja/métodos , Estudios de Casos y Controles , Femenino , Humanos , Inmunohistoquímica , Persona de Mediana Edad , Análisis de Componente Principal , Delgadez/genética , Delgadez/patología
16.
Artículo en Inglés | MEDLINE | ID: mdl-18288255

RESUMEN

The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, k-fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA