RESUMEN
The majority of gene expression studies focus on the search for genes whose mean expression is different between two or more populations of samples in the so-called "differential expression analysis" approach. However, a difference in variance in gene expression may also be biologically and physiologically relevant. In the classical statistical model used to analyze RNA-sequencing (RNA-seq) data, the dispersion, which defines the variance, is only considered as a parameter to be estimated prior to identifying a difference in mean expression between conditions of interest. Here, we propose to evaluate four recently published methods, which detect differences in both the mean and dispersion in RNA-seq data. We thoroughly investigated the performance of these methods on simulated datasets and characterized parameter settings to reliably detect genes with a differential expression dispersion. We applied these methods to The Cancer Genome Atlas datasets. Interestingly, among the genes with an increased expression dispersion in tumors and without a change in mean expression, we identified some key cellular functions, most of which were related to catabolism and were overrepresented in most of the analyzed cancers. In particular, our results highlight autophagy, whose role in cancerogenesis is context-dependent, illustrating the potential of the differential dispersion approach to gain new insights into biological processes and to discover new biomarkers.
Asunto(s)
Modelos Estadísticos , Neoplasias , Humanos , Análisis de Secuencia de ARN/métodos , ARN/genética , Autofagia/genética , Neoplasias/genética , Perfilación de la Expresión Génica/métodosRESUMEN
Drug-target interactions (DTIs) prediction algorithms are used at various stages of the drug discovery process. In this context, specific problems such as deorphanization of a new therapeutic target or target identification of a drug candidate arising from phenotypic screens require large-scale predictions across the protein and molecule spaces. DTI prediction heavily relies on supervised learning algorithms that use known DTIs to learn associations between molecule and protein features, allowing for the prediction of new interactions based on learned patterns. The algorithms must be broadly applicable to enable reliable predictions, even in regions of the protein or molecule spaces where data may be scarce. In this paper, we address two key challenges to fulfill these goals: building large, high-quality training datasets and designing prediction methods that can scale, in order to be trained on such large datasets. First, we introduce LCIdb, a curated, large-sized dataset of DTIs, offering extensive coverage of both the molecule and druggable protein spaces. Notably, LCIdb contains a much higher number of molecules than publicly available benchmarks, expanding coverage of the molecule space. Second, we propose Komet (Kronecker Optimized METhod), a DTI prediction pipeline designed for scalability without compromising performance. Komet leverages a three-step framework, incorporating efficient computation choices tailored for large datasets and involving the Nyström approximation. Specifically, Komet employs a Kronecker interaction module for (molecule, protein) pairs, which efficiently captures determinants in DTIs, and whose structure allows for reduced computational complexity and quasi-Newton optimization, ensuring that the model can handle large training sets, without compromising on performance. Our method is implemented in open-source software, leveraging GPU parallel computation for efficiency. We demonstrate the interest of our pipeline on various datasets, showing that Komet displays superior scalability and prediction performance compared to state-of-the-art deep learning approaches. Additionally, we illustrate the generalization properties of Komet by showing its performance on an external dataset, and on the publicly available LH benchmark designed for scaffold hopping problems. Komet is available open source at https://komet.readthedocs.io and all datasets, including LCIdb, can be found at https://zenodo.org/records/10731712.
Asunto(s)
Algoritmos , Descubrimiento de Drogas , Proteínas , Descubrimiento de Drogas/métodos , Proteínas/química , Proteínas/metabolismo , Preparaciones Farmacéuticas/química , Preparaciones Farmacéuticas/metabolismoRESUMEN
BACKGROUND: Variability in datasets is not only the product of biological processes: they are also the product of technical biases. ComBat and ComBat-Seq are among the most widely used tools for correcting those technical biases, called batch effects, in, respectively, microarray and RNA-Seq expression data. RESULTS: In this technical note, we present a new Python implementation of ComBat and ComBat-Seq. While the mathematical framework is strictly the same, we show here that our implementations: (i) have similar results in terms of batch effects correction; (ii) are as fast or faster than the original implementations in R and; (iii) offer new tools for the bioinformatics community to participate in its development. pyComBat is implemented in the Python language and is distributed under GPL-3.0 ( https://www.gnu.org/licenses/gpl-3.0.en.html ) license as a module of the inmoose package. Source code is available at https://github.com/epigenelabs/inmoose and Python package at https://pypi.org/project/inmoose . CONCLUSIONS: We present a new Python implementation of state-of-the-art tools ComBat and ComBat-Seq for the correction of batch effects in microarray and RNA-Seq data. This new implementation, based on the same mathematical frameworks as ComBat and ComBat-Seq, offers similar power for batch effect correction, at reduced computational cost.
Asunto(s)
Biología Computacional , Programas Informáticos , Teorema de Bayes , Biología Computacional/métodos , RNA-SeqRESUMEN
Genome-wide association studies (GWAS) explore the genetic causes of complex diseases. However, classical approaches ignore the biological context of the genetic variants and genes under study. To address this shortcoming, one can use biological networks, which model functional relationships, to search for functionally related susceptibility loci. Many such network methods exist, each arising from different mathematical frameworks, pre-processing steps, and assumptions about the network properties of the susceptibility mechanism. Unsurprisingly, this results in disparate solutions. To explore how to exploit these heterogeneous approaches, we selected six network methods and applied them to GENESIS, a nationwide French study on familial breast cancer. First, we verified that network methods recovered more interpretable results than a standard GWAS. We addressed the heterogeneity of their solutions by studying their overlap, computing what we called the consensus. The key gene in this consensus solution was COPS5, a gene related to multiple cancer hallmarks. Another issue we observed was that network methods were unstable, selecting very different genes on different subsamples of GENESIS. Therefore, we proposed a stable consensus solution formed by the 68 genes most consistently selected across multiple subsamples. This solution was also enriched in genes known to be associated with breast cancer susceptibility (BLM, CASP8, CASP10, DNAJC1, FGFR2, MRPS30, and SLC4A7, P-value = 3 × 10-4). The most connected gene was CUL3, a regulator of several genes linked to cancer progression. Lastly, we evaluated the biases of each method and the impact of their parameters on the outcome. In general, network methods preferred highly connected genes, even after random rewirings that stripped the connections of any biological meaning. In conclusion, we present the advantages of network-guided GWAS, characterize their shortcomings, and provide strategies to address them. To compute the consensus networks, implementations of all six methods are available at https://github.com/hclimente/gwas-tools.
Asunto(s)
Neoplasias de la Mama , Predisposición Genética a la Enfermedad/genética , Estudio de Asociación del Genoma Completo/métodos , Algoritmos , Neoplasias de la Mama/epidemiología , Neoplasias de la Mama/genética , Bases de Datos Genéticas , Femenino , Humanos , Polimorfismo de Nucleótido Simple/genéticaRESUMEN
BACKGROUND: Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors. METHODS: To identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named "PRL + ML") combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. Results The Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network. Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988-0.992) than either PRL (range 0.916-0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants). CONCLUSIONS: Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries.
Asunto(s)
Neoplasias de la Mama , Proteína BRCA1/genética , Proteína BRCA2/genética , Estudios de Cohortes , Bases de Datos Factuales , Femenino , Predisposición Genética a la Enfermedad , Humanos , Mutación , Estudios Prospectivos , RiesgoRESUMEN
Identification of the protein targets of hit molecules is essential in the drug discovery process. Target prediction with machine learning algorithms can help accelerate this search, limiting the number of required experiments. However, Drug-Target Interactions databases used for training present high statistical bias, leading to a high number of false positives, thus increasing time and cost of experimental validation campaigns. To minimize the number of false positives among predicted targets, we propose a new scheme for choosing negative examples, so that each protein and each drug appears an equal number of times in positive and negative examples. We artificially reproduce the process of target identification for three specific drugs, and more globally for 200 approved drugs. For the detailed three drug examples, and for the larger set of 200 drugs, training with the proposed scheme for the choice of negative examples improved target prediction results: the average number of false positives among the top ranked predicted targets decreased, and overall, the rank of the true targets was improved.Our method corrects databases' statistical bias and reduces the number of false positive predictions, and therefore the number of useless experiments potentially undertaken.
Asunto(s)
Biología Computacional/métodos , Descubrimiento de Drogas/métodos , Aprendizaje Automático , Preparaciones Farmacéuticas/química , Proteínas/química , Programas Informáticos , Humanos , Preparaciones Farmacéuticas/metabolismo , Mapeo de Interacción de Proteínas , Proteínas/metabolismo , Máquina de Vectores de SoporteRESUMEN
MOTIVATION: Finding non-linear relationships between biomolecules and a biological outcome is computationally expensive and statistically challenging. Existing methods have important drawbacks, including among others lack of parsimony, non-convexity and computational overhead. Here we propose block HSIC Lasso, a non-linear feature selector that does not present the previous drawbacks. RESULTS: We compare block HSIC Lasso to other state-of-the-art feature selection techniques in both synthetic and real data, including experiments over three common types of genomic data: gene-expression microarrays, single-cell RNA sequencing and genome-wide association studies. In all cases, we observe that features selected by block HSIC Lasso retain more information about the underlying biology than those selected by other techniques. As a proof of concept, we applied block HSIC Lasso to a single-cell RNA sequencing experiment on mouse hippocampus. We discovered that many genes linked in the past to brain development and function are involved in the biological differences between the types of neurons. AVAILABILITY AND IMPLEMENTATION: Block HSIC Lasso is implemented in the Python 2/3 package pyHSICLasso, available on PyPI. Source code is available on GitHub (https://github.com/riken-aip/pyHSICLasso). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Biomarcadores , Estudio de Asociación del Genoma Completo , Programas Informáticos , Animales , Genoma , Genómica , RatonesRESUMEN
Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen-2, SIFT, FatHMM, MutationTaster-2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants. We here demonstrate in a study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. We show that comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.
Asunto(s)
Biología Computacional/métodos , Mutación Missense , Programas Informáticos , Conjuntos de Datos como Asunto , Humanos , Internet , Reproducibilidad de los Resultados , Navegador WebAsunto(s)
Biología Computacional/métodos , Biología Computacional/normas , Análisis de Datos , Proyectos de Investigación/normas , Antirreumáticos/uso terapéutico , Artritis Reumatoide/tratamiento farmacológico , Artritis Reumatoide/genética , Biología Computacional/estadística & datos numéricos , Estudio de Asociación del Genoma Completo , Humanos , Polimorfismo de Nucleótido Simple , Valor Predictivo de las Pruebas , Proyectos de Investigación/estadística & datos numéricosRESUMEN
MOTIVATION: As an increasing number of genome-wide association studies reveal the limitations of the attempt to explain phenotypic heritability by single genetic loci, there is a recent focus on associating complex phenotypes with sets of genetic loci. Although several methods for multi-locus mapping have been proposed, it is often unclear how to relate the detected loci to the growing knowledge about gene pathways and networks. The few methods that take biological pathways or networks into account are either restricted to investigating a limited number of predetermined sets of loci or do not scale to genome-wide settings. RESULTS: We present SConES, a new efficient method to discover sets of genetic loci that are maximally associated with a phenotype while being connected in an underlying network. Our approach is based on a minimum cut reformulation of the problem of selecting features under sparsity and connectivity constraints, which can be solved exactly and rapidly. SConES outperforms state-of-the-art competitors in terms of runtime, scales to hundreds of thousands of genetic loci and exhibits higher power in detecting causal SNPs in simulation studies than other methods. On flowering time phenotypes and genotypes from Arabidopsis thaliana, SConES detects loci that enable accurate phenotype prediction and that are supported by the literature. AVAILABILITY: Code is available at http://webdav.tuebingen.mpg.de/u/karsten/Forschung/scones/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Sitios Genéticos , Estudio de Asociación del Genoma Completo/métodos , Fenotipo , Polimorfismo de Nucleótido Simple , Arabidopsis/genética , Arabidopsis/crecimiento & desarrollo , Flores , Genotipo , HumanosRESUMEN
Between 30% and 70% of patients with breast cancer have pre-existing chronic conditions, and more than half are on long-term non-cancer medication at the time of diagnosis. Preliminary epidemiological evidence suggests that some non-cancer medications may affect breast cancer risk, recurrence, and survival. In this nationwide cohort study, we assessed the association between medication use at breast cancer diagnosis and survival. We included 235,368 French women with newly diagnosed non-metastatic breast cancer. In analyzes of 288 medications, we identified eight medications positively associated with either overall survival or disease-free survival: rabeprazole, alverine, atenolol, simvastatin, rosuvastatin, estriol (vaginal or transmucosal), nomegestrol, and hypromellose; and eight medications negatively associated with overall survival or disease-free survival: ferrous fumarate, prednisolone, carbimazole, pristinamycin, oxazepam, alprazolam, hydroxyzine, and mianserin. Full results are available online from an interactive platform ( https://adrenaline.curie.fr ). This resource provides hypotheses for drugs that may naturally influence breast cancer evolution.
Asunto(s)
Neoplasias de la Mama , Humanos , Femenino , Neoplasias de la Mama/tratamiento farmacológico , Neoplasias de la Mama/epidemiología , Neoplasias de la Mama/patología , Estudios de Cohortes , Comorbilidad , SimvastatinaRESUMEN
Due to recent advances in genotyping technologies, mapping phenotypes to single loci in the genome has become a standard technique in statistical genetics. However, one-locus mapping fails to explain much of the phenotypic variance in complex traits. Here, we present GLIDE, which maps phenotypes to pairs of genetic loci and systematically searches for the epistatic interactions expected to reveal part of this missing heritability. GLIDE makes use of the computational power of consumer-grade graphics cards to detect such interactions via linear regression. This enabled us to conduct a systematic two-locus mapping study on seven disease data sets from the Wellcome Trust Case Control Consortium and on in-house hippocampal volume data in 6 h per data set, while current single CPU-based approaches require more than a year's time to complete the same task.
Asunto(s)
Mapeo Cromosómico/métodos , Biología Computacional/métodos , Epistasis Genética , Predisposición Genética a la Enfermedad , Trastorno Bipolar/diagnóstico , Trastorno Bipolar/epidemiología , Trastorno Bipolar/genética , Bases de Datos Factuales , Sitios Genéticos , Genética de Población/métodos , Estudio de Asociación del Genoma Completo , Hipocampo/anatomía & histología , Humanos , Modelos Lineales , Tamaño de los Órganos , Fenotipo , Polimorfismo de Nucleótido Simple , Reproducibilidad de los Resultados , Factores de TiempoRESUMEN
We present a network-based protocol to discover susceptibility genes in case-control genome-wide association studies (GWASs). In short, this protocol looks for biomarkers that are informative of disease status and interconnected in an underlying biological network. This boosts discovery and interpretability. Moreover, the protocol tackles the instability of network methods, producing a stable set of genes most likely to replicate in external cohorts. To apply the procedure to a provided GWAS dataset, install the required software and execute our command-line tool. For complete details on the use and execution of this protocol, please refer to Climente-González et al.1.
Asunto(s)
Estudio de Asociación del Genoma Completo , Programas Informáticos , Estudio de Asociación del Genoma Completo/métodosRESUMEN
Genome-Wide Association Studies, or GWAS, aim at finding Single Nucleotide Polymorphisms (SNPs) that are associated with a phenotype of interest. GWAS are known to suffer from the large dimensionality of the data with respect to the number of available samples. Other limiting factors include the dependency between SNPs, due to linkage disequilibrium (LD), and the need to account for population structure, that is to say, confounding due to genetic ancestry.We propose an efficient approach for the multivariate analysis of multi-population GWAS data based on a multitask group Lasso formulation. Each task corresponds to a subpopulation of the data, and each group to an LD-block. This formulation alleviates the curse of dimensionality, and makes it possible to identify disease LD-blocks shared across populations/tasks, as well as some that are specific to one population/task. In addition, we use stability selection to increase the robustness of our approach. Finally, gap safe screening rules speed up computations enough that our method can run at a genome-wide scale.To our knowledge, this is the first framework for GWAS on diverse populations combining feature selection at the LD-groups level, a multitask approach to address population structure, stability selection, and safe screening rules. We show that our approach outperforms state-of-the-art methods on both a simulated and a real-world cancer datasets.
Asunto(s)
Biología Computacional , Estudio de Asociación del Genoma Completo , Genética de Población , Humanos , Desequilibrio de Ligamiento , Fenotipo , Polimorfismo de Nucleótido SimpleRESUMEN
To address the lack of statistical power and interpretability of genome-wide association studies (GWAS), gene-level analyses combine the p-values of individual single nucleotide polymorphisms (SNPs) into gene statistics. However, using all SNPs mapped to a gene, including those with low association scores, can mask the association signal of a gene.We therefore propose a new two-step strategy, consisting in first selecting the SNPs most associated with the phenotype within a given gene, before testing their joint effect on the phenotype. The recently proposed kernelPSI framework for kernel-based post-selection inference makes it possible to model non-linear relationships between features, as well as to obtain valid p-values that account for the selection step.In this paper, we show how we adapted kernelPSI to the setting of quantitative GWAS, using kernels to model epistatic interactions between neighboring SNPs, and post-selection inference to determine the joint effect of selected blocks of SNPs on a phenotype. We illustrate this tool on the study of two continuous phenotypes from the UKBiobank.We show that kernelPSI can be successfully used to study GWAS data and detect genes associated with a phenotype through the signal carried by the most strongly associated regions of these genes. In particular, we show that kernelPSI enjoys more statistical power than other gene-based GWAS tools, such as SKAT or MAGMA.kernelPSI is an effective tool to combine SNP-based and gene-based analyses of GWAS data, and can be used successfully to improve both statistical performance and interpretability of GWAS.
Asunto(s)
Biología Computacional , Estudio de Asociación del Genoma Completo , Humanos , Fenotipo , Polimorfismo de Nucleótido SimpleRESUMEN
BACKGROUND: Detecting epistatic interactions at the gene level is essential to understanding the biological mechanisms of complex diseases. Unfortunately, genome-wide interaction association studies involve many statistical challenges that make such detection hard. We propose a multi-step protocol for epistasis detection along the edges of a gene-gene co-function network. Such an approach reduces the number of tests performed and provides interpretable interactions while keeping type I error controlled. Yet, mapping gene interactions into testable single-nucleotide polymorphism (SNP)-interaction hypotheses, as well as computing gene pair association scores from SNP pair ones, is not trivial. RESULTS: Here we compare 3 SNP-gene mappings (positional overlap, expression quantitative trait loci, and proximity in 3D structure) and use the adaptive truncated product method to compute gene pair scores. This method is non-parametric, does not require a known null distribution, and is fast to compute. We apply multiple variants of this protocol to a genome-wide association study dataset on inflammatory bowel disease. Different configurations produced different results, highlighting that various mechanisms are implicated in inflammatory bowel disease, while at the same time, results overlapped with known disease characteristics. Importantly, the proposed pipeline also differs from a conventional approach where no network is used, showing the potential for additional discoveries when prior biological knowledge is incorporated into epistasis detection.
Asunto(s)
Epistasis Genética , Estudio de Asociación del Genoma Completo , Estudio de Asociación del Genoma Completo/métodos , Fenotipo , Polimorfismo de Nucleótido Simple , Sitios de Carácter CuantitativoRESUMEN
BACKGROUND: For the most part, genome-wide association studies (GWAS) have only partially explained the heritability of complex diseases. One of their limitations is to assume independent contributions of individual variants to the phenotype. Many tools have therefore been developed to investigate the interactions between distant loci, or epistasis. Among them, the recently proposed EpiGWAS models the interactions between a target variant and the rest of the genome. However, applying this approach to studying interactions along all genes of a disease map is not straightforward. Here, we propose a pipeline to that effect, which we illustrate by investigating a multiple sclerosis GWAS dataset from the Wellcome Trust Case Control Consortium 2 through 19 disease maps from the MetaCore pathway database. RESULTS: For each disease map, we build an epistatic network by connecting the genes that are deemed to interact. These networks tend to be connected, complementary to the disease maps and contain hubs. In addition, we report 4 epistatic gene pairs involving missense variants, and 25 gene pairs with a deleterious epistatic effect mediated by eQTLs. Among these, we highlight the interaction of GLI-1 and SUFU, and of IP10 and NF-[Formula: see text]B, as they both match known biological interactions. The latter pair is particularly promising for therapeutic development, as both genes have known inhibitors. CONCLUSIONS: Our study showcases the ability of EpiGWAS to uncover biologically interpretable epistatic interactions that are potentially actionable for the development of combination therapy.
Asunto(s)
Epistasis Genética , Esclerosis Múltiple , Estudios de Casos y Controles , Estudio de Asociación del Genoma Completo , Humanos , Esclerosis Múltiple/genética , FenotipoRESUMEN
PURPOSE: Administering systemic anticancer treatment (SACT) to patients near death can negatively affect their health-related quality of life. Late SACT administrations should be avoided in these cases. Machine learning techniques could be used to build decision support tools leveraging registry data for clinicians to limit late SACT administration. MATERIALS AND METHODS: Patients with advanced lung cancer who were treated at the Department of Oncology, Aalborg University Hospital and died between 2010 and 2019 were included (N = 2,368). Diagnoses, treatments, biochemical data, and histopathologic results were used to train predictive models of 30-day mortality using logistic regression with elastic net penalty, random forest, gradient tree boosting, multilayer perceptron, and long short-term memory network. The importance of the variables and the clinical utility of the models were evaluated. RESULTS: The random forest and gradient tree boosting models outperformed other models, whereas the artificial neural network-based models underperformed. Adding summary variables had a modest effect on performance with an increase in average precision from 0.500 to 0.505 and from 0.498 to 0.509 for the gradient tree boosting and random forest models, respectively. Biochemical results alone contained most of the information with a limited degradation of the performances when fitting models with only these variables. The utility analysis showed that by applying a simple threshold to the predicted risk of 30-day mortality, 40% of late SACT administrations could have been prevented at the cost of 2% of patients stopping their treatment 90 days before death. CONCLUSION: This study demonstrates the potential of a decision support tool to limit late SACT administration in patients with cancer. Further work is warranted to refine the model, build an easy-to-use prototype, and conduct a prospective validation study.
Asunto(s)
Neoplasias Pulmonares , Calidad de Vida , Humanos , Aprendizaje Automático , Modelos Logísticos , Neoplasias Pulmonares/diagnóstico , Neoplasias Pulmonares/tratamiento farmacológico , Redes Neurales de la ComputaciónRESUMEN
More and more biologists and bioinformaticians turn to machine learning to analyze large amounts of data. In this context, it is crucial to understand which is the most suitable data analysis pipeline for achieving reliable results. This process may be challenging, due to a variety of factors, the most crucial ones being the data type and the general goal of the analysis (e.g., explorative or predictive). Life science data sets require further consideration as they often contain measures with a low signal-to-noise ratio, high-dimensional observations, and relatively few samples. In this complex setting, regularization, which can be defined as the introduction of additional information to solve an ill-posed problem, is the tool of choice to obtain robust models. Different regularization practices may be used depending both on characteristics of the data and of the question asked, and different choices may lead to different results. In this article, we provide a comprehensive description of the impact and importance of regularization techniques in life science studies. In particular, we provide an intuition of what regularization is and of the different ways it can be implemented and exploited. We propose four general life sciences problems in which regularization is fundamental and should be exploited for robustness. For each of these large families of problems, we enumerate different techniques as well as examples and case studies. Lastly, we provide a unified view of how to approach each data type with various regularization techniques.
Asunto(s)
Algoritmos , Disciplinas de las Ciencias Biológicas , Aprendizaje AutomáticoRESUMEN
BACKGROUND: Breast cancer (BC) is the most frequent cancer and the leading cause of cancer-related death in women. The French National Cancer Institute has created a national cancer cohort to promote cancer research and improve our understanding of cancer using the National Health Data System (SNDS) and amalgamating all cancer sites. So far, no detailed separate data are available for early BC. OBJECTIVES: To describe the creation of the French Early Breast Cancer Cohort (FRESH). METHODS: All French women aged 18 years or over, with early-stage BC newly diagnosed between 1 January 2011 and 31 December 2017, treated by surgery, and registered in the general health insurance coverage plan were included in the cohort. Patients with suspected locoregional or distant metastases at diagnosis were excluded. BC treatments (surgery, chemotherapy, targeted therapy, radiotherapy, and endocrine therapy), and diagnostic procedures (biopsy, cytology, and imaging) were extracted from hospital discharge reports, outpatient care notes, or pharmacy drug delivery data. The BC subtype was inferred from the treatments received. RESULTS: We included 235,368 patients with early BC in the cohort (median age: 60 years). The BC subtype distribution was as follows: luminal (80.2%), triple-negative (TNBC, 9.5%); HER2+ (10.3%), or unidentifiable (n = 44,388, 18.9% of the cohort). Most patients underwent radiotherapy (n = 200,685, 85.3%) and endocrine therapy (n = 165,655, 70.4%), and 38.3% (n = 90,252) received chemotherapy. Treatments and care pathways are described. CONCLUSIONS: The FRESH Cohort is an unprecedented population-based resource facilitating future large-scale real-life studies aiming to improve care pathways and quality of care for BC patients.