RESUMEN
BACKGROUND: The clustering of data produced by liquid chromatography coupled to mass spectrometry analyses (LC-MS data) has recently gained interest to extract meaningful chemical or biological patterns. However, recent instrumental pipelines deliver data which size, dimensionality and expected number of clusters are too large to be processed by classical machine learning algorithms, so that most of the state-of-the-art relies on single pass linkage-based algorithms. RESULTS: We propose a clustering algorithm that solves the powerful but computationally demanding kernel k-means objective function in a scalable way. As a result, it can process LC-MS data in an acceptable time on a multicore machine. To do so, we combine three essential features: a compressive data representation, Nyström approximation and a hierarchical strategy. In addition, we propose new kernels based on optimal transport, which interprets as intuitive similarity measures between chromatographic elution profiles. CONCLUSIONS: Our method, referred to as CHICKN, is evaluated on proteomics data produced in our lab, as well as on benchmark data coming from the literature. From a computational viewpoint, it is particularly efficient on raw LC-MS data. From a data analysis viewpoint, it provides clusters which differ from those resulting from state-of-the-art methods, while achieving similar performances. This highlights the complementarity of differently principle algorithms to extract the best from complex LC-MS data.
Asunto(s)
Algoritmos , Análisis por Conglomerados , Péptidos , Proteómica , Cromatografía Liquida , Compresión de Datos , Espectrometría de Masas , Péptidos/química , Proteómica/métodosRESUMEN
The term "spectral clustering" is sometimes used to refer to the clustering of mass spectrometry data. However, it also classically refers to a family of popular clustering algorithms. To avoid confusion, a more specific term could advantageously be coined.
Asunto(s)
Análisis por Conglomerados , Espectrometría de Masas/métodos , Terminología como Asunto , Algoritmos , Proteómica/métodosRESUMEN
Memristive structures are among the most promising options to be components of neuromorphic devices. However, the formation of HfO2-based devices in crossbar arrays requires considerable time since electroforming is a single stochastic operation. In this study, we investigate how Ar+ plasma immersion ion implantation (PI) affects the Pt/HfO2 (4 nm)/HfOXNY (3 nm)/TaN electroforming voltage. The advantage of PI is the simultaneous and uniform processing of the entire wafer. It is thought that Ar+ implantation causes defects to the oxide matrix, with the majority of the oxygen anions being shifted in the direction of the TaN electrode. We demonstrate that it is feasible to reduce the electroforming voltages from 7.1 V to values less than 3 V by carefully selecting the implantation energy. A considerable decrease in the electroforming voltage was achievable at an implantation energy that provided the dispersion of recoils over the whole thickness of the oxide without significantly affecting the HfOXNY/TaN interface. At the same time, Ar+ PI at higher and lower energies did not produce the same significant decrease in the electroforming voltage. It is also possible to obtain self-compliance of current in the structure during electroforming after PI with energy less than 2 keV.
RESUMEN
An important question in microbiology is whether treatment causes changes in gut flora, and whether it also affects metabolism. The reconstruction of causal relations purely from non-temporal observational data is challenging. We address the problem of causal inference in a bivariate case, where the joint distribution of two variables is observed. We consider, in particular, data on discrete domains. The state-of-the-art causal inference methods for continuous data suffer from high computational complexity. Some modern approaches are not suitable for categorical data, and others need to estimate and fix multiple hyper-parameters. In this contribution, we introduce a novel method of causal inference which is based on the widely used assumption that if X causes Y, then P(X) and P(Y|X) are independent. We propose to explore a semi-supervised approach where P(Y|X) and P(X) are estimated from labeled and unlabeled data respectively, whereas the marginal probability is estimated potentially from much more (cheap unlabeled) data than the conditional distribution. We validate the proposed method on the standard cause-effect pairs. We illustrate by experiments on several benchmarks of biological network reconstruction that the proposed approach is very competitive in terms of computational time and accuracy compared to the state-of-the-art methods. Finally, we apply the proposed method to an original medical task where we study whether drugs confound human metagenome.
Asunto(s)
Biología Computacional/métodos , Modelos Estadísticos , Redes Neurales de la Computación , Causalidad , Bases de Datos Genéticas , Microbioma Gastrointestinal/efectos de los fármacos , Humanos , Metagenoma/genética , Metformina/farmacología , Aprendizaje Automático SupervisadoRESUMEN
BACKGROUND: Histones organize DNA into chromatin through a variety of processes. Among them, a vast diversity of histone variants can be incorporated into chromatin and finely modulate its organization and functionality. Classically, the study of histone variants has largely relied on antibody-based assays. However, antibodies have a limited efficiency to discriminate between highly similar histone variants. RESULTS: In this study, we established a mass spectrometry-based analysis to address this challenge. We developed a targeted proteomics method, using selected reaction monitoring or parallel reaction monitoring, to quantify a maximum number of histone variants in a single multiplexed assay, even when histones are present in a crude extract. This strategy was developed on H2A and H2B variants, using 55 peptides corresponding to 25 different histone sequences, among which a few differ by a single amino acid. The methodology was then applied to mouse testis extracts in which almost all histone variants are expressed. It confirmed the abundance profiles of several testis-specific histones during successive stages of spermatogenesis and the existence of predicted H2A.L.1 isoforms. This methodology was also used to explore the over-expression pattern of H2A.L.1 isoforms in a mouse model of male infertility. CONCLUSIONS: Our results demonstrate that targeted proteomics is a powerful method to quantify highly similar histone variants and isoforms. The developed method can be easily transposed to the study of human histone variants, whose abundance can be deregulated in various diseases.