RESUMEN
The rapid development of single-cell DNA sequencing (scDNA-seq) technology has greatly enhanced the resolution of tumor cell profiling, providing an unprecedented perspective in characterizing intra-tumoral heterogeneity and understanding tumor progression and metastasis. However, prominent algorithms for constructing tumor phylogeny based on scDNA-seq data usually only take single nucleotide variations (SNVs) as markers, failing to consider the effect caused by copy number alterations (CNAs). Here, we propose BiTSC$^2$, Bayesian inference of Tumor clonal Tree by joint analysis of Single-Cell SNV and CNA data. BiTSC$^2$ takes raw reads from scDNA-seq as input, accounts for the overlapping of CNA and SNV, models allelic dropout rate, sequencing errors and missing rate, as well as assigns single cells into subclones. By applying Markov Chain Monte Carlo sampling, BiTSC$^2$ can simultaneously estimate the subclonal scCNA and scSNV genotype matrices, subclonal assignments and tumor subclonal evolutionary tree. In comparison with existing methods on synthetic and real tumor data, BiTSC$^2$ shows high accuracy in genotype recovery, subclonal assignment and tree reconstruction. BiTSC$^2$ also performs robustly in dealing with scDNA-seq data with low sequencing depth and variant missing rate. BiTSC$^2$ software is available at https://github.com/ucasdp/BiTSC2.
Asunto(s)
Neoplasias , Algoritmos , Teorema de Bayes , Variaciones en el Número de Copia de ADN , Humanos , Neoplasias/genética , Análisis de Secuencia de ADN , Programas InformáticosRESUMEN
In tandem mass spectrometry-based proteomics, proteins are digested into peptides by specific protease(s), but generally only a fraction of peptides can be detected. To characterize detectable proteotypic peptides, we have developed a series of methods to predict peptide digestibility and detectability. Here, we propose a bidirectional long short-term memory (BiLSTM)-based algorithm, named DeepDetect, for the prediction of peptide detectability enhanced by peptide digestibility. Compared with existing algorithms, DeepDetect is featured by its improved prediction accuracy for a wide range of commonly used proteases, covering trypsin, ArgC, chymotrypsin, GluC, LysC, AspN, LysN, and LysargiNase. On 11 test data sets from E. coli, yeast, mouse, and human samples, DeepDetect achieved higher prediction accuracies than PepFormer, a state-of-the-art deep-learning-based peptide detectability prediction algorithm. The results further demonstrated that peptide digestibility can substantially enhance the performance of peptide detectability predictors. As an application, DeepDetect was used to reduce the in silico predicted spectral libraries in data-independent acquisition mass spectrometry data analysis. Experiments using DIA-NN software showed that DeepDetect can significantly accelerate the library search without loss of peptide and protein identification sensitivity.
Asunto(s)
Aprendizaje Profundo , Animales , Ratones , Humanos , Escherichia coli/metabolismo , Péptidos/química , Proteínas/análisis , Espectrometría de Masas en Tándem/métodos , Saccharomyces cerevisiae/metabolismo , Péptido Hidrolasas/metabolismo , Biblioteca de Péptidos , Proteoma/análisisRESUMEN
The traditional approaches to false discovery rate (FDR) control in multiple hypothesis testing are usually based on the null distribution of a test statistic. However, all types of null distributions, including the theoretical, permutation-based and empirical ones, have some inherent drawbacks. For example, the theoretical null might fail because of improper assumptions on the sample distribution. Here, we propose a null distribution-free approach to FDR control for multiple hypothesis testing in the case-control study. This approach, named target-decoy procedure, simply builds on the ordering of tests by some statistic or score, the null distribution of which is not required to be known. Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries. We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests. Simulation demonstrates that it is more stable and powerful than two popular traditional approaches, even in the existence of dependency. Evaluation is also made on two real datasets, including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.
RESUMEN
MOTIVATION: Single-cell sequencing (SCS) data provide unprecedented insights into intratumoral heterogeneity. With SCS, we can better characterize clonal genotypes and reconstruct phylogenetic relationships of tumor cells/clones. However, SCS data are often error-prone, making their computational analysis challenging. RESULTS: To infer the clonal evolution in tumor from the error-prone SCS data, we developed an efficient computational framework, termed RobustClone. It recovers the true genotypes of subclones based on the extended robust principal component analysis, a low-rank matrix decomposition method, and reconstructs the subclonal evolutionary tree. RobustClone is a model-free method, which can be applied to both single-cell single nucleotide variation (scSNV) and single-cell copy-number variation (scCNV) data. It is efficient and scalable to large-scale datasets. We conducted a set of systematic evaluations on simulated datasets and demonstrated that RobustClone outperforms state-of-the-art methods in large-scale data both in accuracy and efficiency. We further validated RobustClone on two scSNV and two scCNV datasets and demonstrated that RobustClone could recover genotype matrix and infer the subclonal evolution tree accurately under various scenarios. In particular, RobustClone revealed the spatial progression patterns of subclonal evolution on the large-scale 10X Genomics scCNV breast cancer dataset. AVAILABILITY AND IMPLEMENTATION: RobustClone software is available at https://github.com/ucasdp/RobustClone. CONTACT: lwan@amss.ac.cn or maliang@ioz.ac.cn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Neoplasias , Programas Informáticos , Algoritmos , Evolución Clonal/genética , Genómica , Humanos , Neoplasias/genética , FilogeniaRESUMEN
The open (mass tolerant) search of tandem mass spectra of peptides shows great potential in the comprehensive detection of post-translational modifications (PTMs) in shotgun proteomics. However, this search strategy has not been widely used by the community, and one bottleneck of it is the lack of appropriate algorithms for automated and reliable post-processing of the coarse and error-prone search results. Here we present PTMiner, a software tool for confident filtering and localization of modifications (mass shifts) detected in an open search. After mass-shift-grouped false discovery rate (FDR) control of peptide-spectrum matches (PSMs), PTMiner uses an empirical Bayesian method to localize modifications through iterative learning of the prior probabilities of each type of modification occurring on different amino acids. The performance of PTMiner was evaluated on three data sets, including simulated data, chemically synthesized peptide library data and modified-peptide spiked-in proteome data. The results showed that PTMiner can effectively control the PSM FDR and accurately localize the modification sites. At 1% real false localization rate (FLR), PTMiner localized 93%, 84 and 83% of the modification sites in the three data sets, respectively, far higher than two open search engines we used and an extended version of the Ascore localization algorithm. We then used PTMiner to analyze a draft map of human proteome containing 25 million spectra from 30 tissues, and confidently identified over 1.7 million modified PSMs at 1% FDR and 1% FLR, which provided a system-wide view of both known and unknown PTMs in the human proteome.
Asunto(s)
Péptidos/química , Procesamiento Proteico-Postraduccional , Proteómica/métodos , Bases de Datos de Proteínas , Humanos , Motor de Búsqueda , Programas InformáticosRESUMEN
BACKGROUND: In shotgun proteomics, database searching of tandem mass spectra results in a great number of peptide-spectrum matches (PSMs), many of which are false positives. Quality control of PSMs is a multiple hypothesis testing problem, and the false discovery rate (FDR) or the posterior error probability (PEP) is the commonly used statistical confidence measure. PEP, also called local FDR, can evaluate the confidence of individual PSMs and thus is more desirable than FDR, which evaluates the global confidence of a collection of PSMs. Estimation of PEP can be achieved by decomposing the null and alternative distributions of PSM scores as long as the given data is sufficient. However, in many proteomic studies, only a group (subset) of PSMs, e.g. those with specific post-translational modifications, are of interest. The group can be very small, making the direct PEP estimation by the group data inaccurate, especially for the high-score area where the score threshold is taken. Using the whole set of PSMs to estimate the group PEP is inappropriate either, because the null and/or alternative distributions of the group can be very different from those of combined scores. RESULTS: The transfer PEP algorithm is proposed to more accurately estimate the PEPs of peptide identifications in small groups. Transfer PEP derives the group null distribution through its empirical relationship with the combined null distribution, and estimates the group alternative distribution, as well as the null proportion, using an iterative semi-parametric method. Validated on both simulated data and real proteomic data, transfer PEP showed remarkably higher accuracy than the direct combined and separate PEP estimation methods. CONCLUSIONS: We presented a novel approach to group PEP estimation for small groups and implemented it for the peptide identification problem in proteomics. The methodology of the approach is in principle applicable to the small-group PEP estimation problems in other fields.
Asunto(s)
Cómputos Matemáticos , Péptidos/química , Algoritmos , Probabilidad , Procesamiento Proteico-Postraduccional , Proteómica , Espectrometría de Masas en TándemRESUMEN
MOTIVATION: Visualizing and reconstructing cell developmental trajectories intrinsically embedded in high-dimensional expression profiles of single-cell RNA sequencing (scRNA-seq) snapshot data are computationally intriguing, but challenging. RESULTS: We propose DensityPath, an algorithm allowing (i) visualization of the intrinsic structure of scRNA-seq data on an embedded 2-d space and (ii) reconstruction of an optimal cell state-transition path on the density landscape. DensityPath powerfully handles high dimensionality and heterogeneity of scRNA-seq data by (i) revealing the intrinsic structures of data, while adopting a non-linear dimension reduction algorithm, termed elastic embedding, which can preserve both local and global structures of the data; and (ii) extracting the topological features of high-density, level-set clusters from a single-cell multimodal density landscape of transcriptional heterogeneity, as the representative cell states. DensityPath reconstructs the optimal cell state-transition path by finding the geodesic minimum spanning tree of representative cell states on the density landscape, establishing a least action path with the minimum-transition-energy of cell fate decisions. We demonstrate that DensityPath can ably reconstruct complex trajectories of cell development, e.g. those with multiple bifurcating and trifurcating branches, while maintaining computational efficiency. Moreover, DensityPath has high accuracy for pseudotime calculation and branch assignment on real scRNA-seq, as well as simulated datasets. DensityPath is robust to parameter choices, as well as permutations of data. AVAILABILITY AND IMPLEMENTATION: DensityPath software is available at https://github.com/ucasdp/DensityPath. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
ARN/genética , Algoritmos , Perfilación de la Expresión Génica , Análisis de Secuencia de ARN , Análisis de la Célula IndividualRESUMEN
Motivation: Single-cell RNA sequencing (scRNA-seq) has become a valuable tool for studying cellular heterogeneity. However, the analysis of scRNA-seq data is challenging because of inherent noise and technical variability. Existing methods often struggle to simultaneously explore heterogeneity across cells, handle dropout events, and account for batch effects. These drawbacks call for a robust and comprehensive method that can address these challenges and provide accurate insights into heterogeneity at the single-cell level. Results: In this study, we introduce scVIC, an algorithm designed to account for variational inference, while simultaneously handling biological heterogeneity and batch effects at the single-cell level. scVIC explicitly models both biological heterogeneity and technical variability to learn cellular heterogeneity in a manner free from dropout events and the bias of batch effects. By leveraging variational inference, we provide a robust framework for inferring the parameters of scVIC. To test the performance of scVIC, we employed both simulated and biological scRNA-seq datasets, either including, or not, batch effects. scVIC was found to outperform other approaches because of its superior clustering ability and circumvention of the batch effects problem. Availability and implementation: The code of scVIC and replication for this study are available at https://github.com/HiBearME/scVIC/tree/v1.0.
RESUMEN
Robust Principal Component Analysis (RPCA) offers a powerful tool for recovering a low-rank matrix from highly corrupted data, with growing applications in computational biology. Biological processes commonly form intrinsic hierarchical structures, such as tree structures of cell development trajectories and tumor evolutionary history. The rapid development of single-cell sequencing (SCS) technology calls for the recovery of embedded tree structures from noisy and heterogeneous SCS data. In this study, we propose RobustTree, a unified framework to reconstruct the inherent topological structure underlying high-dimensional data with noise. By extending RPCA to handle tree structure optimization, RobustTree leverages data denoising, clustering, and tree structure reconstruction. It solves the tree optimization problem with an adaptive parameter selection scheme that we proposed. In addition to recovering real datasets, RobustTree can reconstruct continuous topological structure and discrete-state topological structure of underlying SCS data. We apply RobustTree on multiple synthetic and real datasets and demonstrate its high accuracy and robustness when analyzing high-noise SCS data with embedded complex structures. The code is available at https://github.com/ucasdp/RobustTree.
RESUMEN
National Materials Data Management and Service platform (NMDMS) is a materials data repository for the publication and sharing of heterogeneous materials scientific data and follows the FAIR principles: Findable, Accessible, Interoperable, and Reusable. To ensure data are 'Interoperable, NMDMS uses a user-friendly semi-structured scientific data model, named dynamic container', to define, exchange, and store heterogeneous scientific data. Then, a personalized yet standardized data submission subsystem, a rigorous project data review and publication subsystem, and a multi-granularity data query and retrieval subsystem collaboratively make data 'Reusable', 'Findable', and 'Accessible'. Finally, China's "National Key R&D Program: Material Genetic Engineering Key Special Project" has adopted NMDMS to publish and share its project data. There are 12,251,040 pieces of data published in NMDMS since 2018, under 87 categories and 1,912 user-defined schemas from 45 projects. The platform has been accessed 908875 times, and 2403,208 pieces of data have been downloaded. In short, NMDMS effectively accelerates the publication and sharing of material project data in China.
RESUMEN
The dramatic increase in amount and size of single-cell RNA sequencing data calls for more efficient and scalable dimensional reduction and visualization tools. Here, we design a GPU-accelerated method, NeuralEE, which aggregates the advantages of elastic embedding and neural network. We show that NeuralEE is both scalable and generalizable in dimensional reduction and visualization of large-scale scRNA-seq data. In addition, the GPU-based implementation of NeuralEE makes it applicable to limited computational resources while maintains high performance, as it takes only half an hour to visualize 1.3 million mice brain cells, and NeuralEE has generalizability for integrating newly generated data.
RESUMEN
Analysis linking directly genomics, neuroimaging phenotypes and clinical measurements is crucial for understanding psychiatric disorders, but remains rare. Here, we describe a multi-scale analysis using genome-wide SNPs, gene expression, grey matter volume (GMV), and the positive and negative syndrome scale scores (PANSS) to explore the etiology of schizophrenia. With 72 drug-naive schizophrenic first episode patients (FEPs) and 73 matched heathy controls, we identified 108 genes, from schizophrenia risk genes, that correlated significantly with GMV, which are highly co-expressed in the brain during development. Among these 108 candidates, 19 distinct genes were found associated with 16 brain regions referred to as hot clusters (HCs), primarily in the frontal cortex, sensory-motor regions and temporal and parietal regions. The patients were subtyped into three groups with distinguishable PANSS scores by the GMV of the identified HCs. Furthermore, we found that HCs with common GMV among patient groups are related to genes that mostly mapped to pathways relevant to neural signaling, which are associated with the risk for schizophrenia. Our results provide an integrated view of how genetic variants may affect brain structures that lead to distinct disease phenotypes. The method of multi-scale analysis that was described in this research, may help to advance the understanding of the etiology of schizophrenia.
Asunto(s)
Encéfalo/patología , Esquizofrenia/diagnóstico , Femenino , Predisposición Genética a la Enfermedad , Sustancia Gris/patología , Humanos , Masculino , Polimorfismo de Nucleótido Simple/genética , Esquizofrenia/clasificaciónRESUMEN
Study of single amino acid variations (SAVs) of proteins, resulting from single nucleotide polymorphisms, is of great importance for understanding the relationships between genotype and phenotype. In mass spectrometry based shotgun proteomics, identification of peptides with SAVs often suffers from high error rates on the variant sites detected. These site errors are due to multiple reasons and can be confirmed by manual inspection or genomic sequencing. Here, we present a software tool, named SAVControl, for site-level quality control of variant peptide identifications. It mainly includes strict false discovery rate control of variant peptide identifications and variant site verification by unrestrictive mass shift relocalization. SAVControl was validated on three colorectal adenocarcinoma cell line datasets with genomic sequencing evidences and tested on a colorectal cancer dataset from The Cancer Genome Atlas. The results show that SAVControl can effectively remove false detections of SAVs. SIGNIFICANCE: Protein sequence variations caused by single nucleotide polymorphisms (SNPs) are single amino acid variations (SAVs). The investigation of SAVs may provide a chance for understanding the relationships between genotype and phenotype. Mass spectrometry (MS) based proteomics provides a large-scale way to detect SAVs. However, using the current analysis strategy to detect SAVs may lead to high rate of false positives. The SAVControl we present here is a computational workflow and software tool for site-level quality control of SAVs detected by MS. It accesses the confidence of detected variant sites by relocating the mass shift responsible for an SAV to search for alternative interpretations. In addition, it uses a strict false discovery rate control method for variant peptide identifications. The advantages of SAVControl were demonstrated on three colorectal adenocarcinoma cell line datasets and a colorectal cancer dataset. We believe that SAVControl will be a powerful tool for computational proteomics and proteogenomics.
Asunto(s)
Sustitución de Aminoácidos , Aminoácidos/análisis , Proteómica/métodos , Espectrometría de Masas en Tándem/métodos , Adenocarcinoma/genética , Adenocarcinoma/metabolismo , Secuencia de Aminoácidos , Aminoácidos/química , Aminoácidos/genética , Neoplasias Colorrectales/genética , Neoplasias Colorrectales/metabolismo , Bases de Datos de Proteínas , Conjuntos de Datos como Asunto , Células HCT116 , Humanos , Mutación Missense , Péptidos/análisis , Péptidos/química , Péptidos/genética , Polimorfismo de Nucleótido Simple , Control de Calidad , Motor de Búsqueda , Programas Informáticos , Espectrometría de Masas en Tándem/normas , Células Tumorales CultivadasRESUMEN
Type 2 diabetes (T2D) is a complex and polygenic disease yet in need of a complete picture of its development mechanisms. To better understand the mechanisms, we examined gene expression profiles of multi-tissues from outbred mice fed with a high-fat diet (HFD) or regular chow at weeks 1, 9, and 18. To analyze such complex data, we proposed a novel dual eigen-analysis, in which the sample- and gene-eigenvectors correspond respectively to the macro- and micro-biology information. The dual eigen-analysis identified the HFD eigenvectors as well as the endogenous eigenvectors for each tissue. The results imply that HFD influences the hepatic function or the pancreatic development as an exogenous factor, while in adipose HFD's impact roughly coincides with the endogenous eigenvector driven by aging. The enrichment analysis of the eigenvectors revealed diverse HFD impact on the three tissues over time. The diversity includes: inflammation, degradation of branched chain amino acids (BCAA), and regulation of peroxisome proliferator activated receptor gamma (PPARγ). We reported that in the pancreas remarkable up-regulation of angiogenesis as downstream of the HIF signaling pathway precedes hyperinsulinemia. The dual eigen-analysis and discoveries provide new evaluations/guidance in T2D prevention and therapy, and will also promote new thinking in biology and medicine.
Asunto(s)
Diabetes Mellitus Tipo 2/genética , Perfilación de la Expresión Génica , Especificidad de Órganos/genética , Adiponectina/metabolismo , Tejido Adiposo/metabolismo , Aminoácidos de Cadena Ramificada/metabolismo , Animales , Colesterol/biosíntesis , Dieta Alta en Grasa , Regulación hacia Abajo/genética , Insulina/metabolismo , Hígado/metabolismo , Ratones , PPAR gamma/metabolismo , Páncreas/metabolismo , Transducción de Señal , Regulación hacia Arriba/genéticaRESUMEN
Piwi-interacting RNA (piRNA) is a class of small non-coding RNAs about 24 to 32 nucleotides long, associated with PIWI proteins, which are involved in germline development, transposon silencing, and epigenetic regulation. Identification of piRNA loci on the genome is very useful for further studies in the biogenesis and function of piRNAs. To accomplish this, we applied the computational biology tool Teiresias to identify motifs of variable length appearing frequently in mouse piRNA and non-piRNA sequences, respectively, and then proposed an algorithm for piRNA identification based on motif discovery, termed "Pibomd" by using these sequence motifs as features in the Support Vector Machine (SVM) algorithm, a sensitivity of 91.48% and a specificity of 89.76% on a mouse test dataset could be achieved, much better results than those reported in previously published algorithms. We also trained an unbalanced SVM classifier (named as "Asym-Pibomd") that provided a higher specificity (96.2%) and a lower sensitivity (72.68%) than Pibomd. Inspite of the predicted ACC being less than that of Pibomd, the predicted ACC (84.44%) of Asym-Pibomd is about ten percent more than that obtained using the k-mer method. Further analysis of the motif positions on the piRNA sequences showed that the piRNA sequences may contain information at the 5'- and/or 3'-end recognized by the piRNA processing apparatus of actual piRNA precursors. Furthermore, this prediction method can be found on a user-friendly web server found at http://app.aporc.org/Pibomd/.
Asunto(s)
Conformación de Ácido Nucleico , ARN Interferente Pequeño/genética , Algoritmos , Animales , Biología Computacional , Internet , Ratones , Modelos Moleculares , Análisis de Secuencia de ARN , Programas Informáticos , Máquina de Vectores de SoporteRESUMEN
BACKGROUND: Upwards of 1200 miRNA loci have hitherto been annotated in the human genome. The specific features defining a miRNA precursor and deciding its recognition and subsequent processing are not yet exhaustively described and miRNA loci can thus not be computationally identified with sufficient confidence. RESULTS: We rendered pre-miRNA and non-pre-miRNA hairpins as strings of integrated sequence-structure information, and used the software Teiresias to identify sequence-structure motifs (ss-motifs) of variable length in these data sets. Using only ss-motifs as features in a Support Vector Machine (SVM) algorithm for pre-miRNA identification achieved 99.2% specificity and 97.6% sensitivity on a human test data set, which is comparable to previously published algorithms employing combinations of sequence-structure and additional features. Further analysis of the ss-motif information contents revealed strongly significant deviations from those of the respective training sets, revealing important potential clues as to how the sequence and structural information of RNA hairpins are utilized by the miRNA processing apparatus. CONCLUSION: Integrated sequence-structure motifs of variable length apparently capture nearly all information required to distinguish miRNA precursors from other stem-loop structures.