Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 75
Filter
Add more filters

Country/Region as subject
Publication year range
1.
Mol Cell Proteomics ; 23(3): 100738, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38364992

ABSTRACT

Wind is one of the most prevalent environmental forces entraining plants to develop various mechano-responses, collectively called thigmomorphogenesis. Largely unknown is how plants transduce these versatile wind force signals downstream to nuclear events and to the development of thigmomorphogenic phenotype or anemotropic response. To identify molecular components at the early steps of the wind force signaling, two mechanical signaling-related phosphoproteins, identified from our previous phosphoproteomic study of Arabidopsis touch response, mitogen-activated protein kinase kinase 1 (MKK1) and 2 (MKK2), were selected for performing in planta TurboID (ID)-based quantitative proximity-labeling (PL) proteomics. This quantitative biotinylproteomics was separately performed on MKK1-ID and MKK2-ID transgenic plants, respectively, using the genetically engineered TurboID biotin ligase expression transgenics as a universal control. This unique PTM proteomics successfully identified 11 and 71 MKK1 and MKK2 putative interactors, respectively. Biotin occupancy ratio (BOR) was found to be an alternative parameter to measure the extent of proximity and specificity between the proximal target proteins and the bait fusion protein. Bioinformatics analysis of these biotinylprotein data also found that TurboID biotin ligase favorably labels the loop region of target proteins. A WInd-Related Kinase 1 (WIRK1), previously known as rapidly accelerated fibrosarcoma (Raf)-like kinase 36 (RAF36), was found to be a putative common interactor for both MKK1 and MKK2 and preferentially interacts with MKK2. Further molecular biology studies of the Arabidopsis RAF36 kinase found that it plays a role in wind regulation of the touch-responsive TCH3 and CML38 gene expression and the phosphorylation of a touch-regulated PATL3 phosphoprotein. Measurement of leaf morphology and shoot gravitropic response of wirk1 (raf36) mutant revealed that the WIRK1 gene is involved in both wind-triggered rosette thigmomorphogenesis and gravitropism of Arabidopsis stems, suggesting that the WIRK1 (RAF36) protein probably functioning upstream of both MKK1 and MKK2 and that it may serve as the crosstalk point among multiple mechano-signal transduction pathways mediating both wind mechano-response and gravitropism.


Subject(s)
Arabidopsis Proteins , Arabidopsis , Arabidopsis/genetics , Arabidopsis/metabolism , Gravitropism , Biotin/metabolism , Wind , Arabidopsis Proteins/genetics , Arabidopsis Proteins/metabolism , Phosphoproteins/metabolism , Ligases/metabolism , Calmodulin/metabolism
2.
J Proteome Res ; 23(6): 1960-1969, 2024 Jun 07.
Article in English | MEDLINE | ID: mdl-38770571

ABSTRACT

Peptide identification is important in bottom-up proteomics. Post-translational modifications (PTMs) are crucial in regulating cellular activities. Many database search methods have been developed to identify peptides with PTMs and characterize the PTM patterns. However, the PTMs on peptides hinder the peptide identification rate and the PTM characterization precision, especially for peptides with multiple PTMs. To address this issue, we present a sensitive open search engine, PIPI2, with much better performance on peptides with multiple PTMs than other methods. With a greedy approach, we simplify the PTM characterization problem into a linear one, which enables characterizing multiple PTMs on one peptide. On the simulation data sets with up to four PTMs per peptide, PIPI2 identified over 90% of the spectra, at least 56% more than five other competitors. PIPI2 also characterized these PTM patterns with the highest precision of 77%, demonstrating a significant advantage in handling peptides with multiple PTMs. In the real applications, PIPI2 identified 30% to 88% more peptides with PTMs than its competitors.


Subject(s)
Databases, Protein , Peptides , Protein Processing, Post-Translational , Proteomics , Search Engine , Peptides/chemistry , Peptides/metabolism , Proteomics/methods , Humans , Software , Amino Acid Sequence , Algorithms
3.
Diabetologia ; 67(5): 837-849, 2024 May.
Article in English | MEDLINE | ID: mdl-38413437

ABSTRACT

AIMS/HYPOTHESIS: The aim of this study was to describe the metabolome in diabetic kidney disease (DKD) and its association with incident CVD in type 2 diabetes, and identify prognostic biomarkers. METHODS: From a prospective cohort of individuals with type 2 diabetes, baseline sera (N=1991) were quantified for 170 metabolites using NMR spectroscopy with median 5.2 years of follow-up. Associations of chronic kidney disease (CKD, eGFR<60 ml/min per 1.73 m2) or severely increased albuminuria with each metabolite were examined using linear regression, adjusted for confounders and multiplicity. Associations between DKD (CKD or severely increased albuminuria)-related metabolites and incident CVD were examined using Cox regressions. Metabolomic biomarkers were identified and assessed for CVD prediction and replicated in two independent cohorts. RESULTS: At false discovery rate (FDR)<0.05, 156 metabolites were associated with DKD (151 for CKD and 128 for severely increased albuminuria), including apolipoprotein B-containing lipoproteins, HDL, fatty acids, phenylalanine, tyrosine, albumin and glycoprotein acetyls. Over 5.2 years of follow-up, 75 metabolites were associated with incident CVD at FDR<0.05. A model comprising age, sex and three metabolites (albumin, triglycerides in large HDL and phospholipids in small LDL) performed comparably to conventional risk factors (C statistic 0.765 vs 0.762, p=0.893) and adding the three metabolites further improved CVD prediction (C statistic from 0.762 to 0.797, p=0.014) and improved discrimination and reclassification. The 3-metabolite score was validated in independent Chinese and Dutch cohorts. CONCLUSIONS/INTERPRETATION: Altered metabolomic signatures in DKD are associated with incident CVD and improve CVD risk stratification.


Subject(s)
Cardiovascular Diseases , Diabetes Mellitus, Type 2 , Diabetic Nephropathies , Renal Insufficiency, Chronic , Humans , Diabetic Nephropathies/metabolism , Cardiovascular Diseases/complications , Prospective Studies , Hong Kong/epidemiology , Albuminuria , Biological Specimen Banks , Glomerular Filtration Rate , Biomarkers , Albumins
4.
BMC Bioinformatics ; 24(1): 351, 2023 Sep 20.
Article in English | MEDLINE | ID: mdl-37730532

ABSTRACT

BACKGROUND: Cross-linking mass spectrometry (XL-MS) is a powerful technique for detecting protein-protein interactions (PPIs) and modeling protein structures in a high-throughput manner. In XL-MS experiments, proteins are cross-linked by a chemical reagent (namely cross-linker), fragmented, and then fed into a tandem mass spectrum (MS/MS). Cross-linkers are either cleavable or non-cleavable, and each type requires distinct data analysis tools. However, both types of cross-linkers suffer from imbalanced fragmentation efficiency, resulting in a large number of unidentifiable spectra that hinder the discovery of PPIs and protein conformations. To address this challenge, researchers have sought to improve the sensitivity of XL-MS through invention of novel cross-linking reagents, optimization of sample preparation protocols, and development of data analysis algorithms. One promising approach to developing new data analysis methods is to apply a protein feedback mechanism in the analysis. It has significantly improved the sensitivity of analysis methods in the cleavable cross-linking data. The application of the protein feedback mechanism to the analysis of non-cleavable cross-linking data is expected to have an even greater impact because the majority of XL-MS experiments currently employs non-cleavable cross-linkers. RESULTS: In this study, we applied the protein feedback mechanism to the analysis of both non-cleavable and cleavable cross-linking data and observed a substantial improvement in cross-link spectrum matches (CSMs) compared to conventional methods. Furthermore, we developed a new software program, ECL 3.0, that integrates two algorithms and includes a user-friendly graphical interface to facilitate wider applications of this new program. CONCLUSIONS: ECL 3.0 source code is available at https://github.com/yuweichuan/ECL-PF.git . A quick tutorial is available at https://youtu.be/PpZgbi8V2xI .


Subject(s)
Peptides , Tandem Mass Spectrometry , Algorithms , Cross-Linking Reagents , Data Analysis
5.
J Proteome Res ; 22(1): 101-113, 2023 01 06.
Article in English | MEDLINE | ID: mdl-36480279

ABSTRACT

Improving the sensitivity of protein-protein interaction detection and protein structure probing is a principal challenge in cross-linking mass spectrometry (XL-MS) data analysis. In this paper, we propose an exhaustive cross-linking search method with protein feedback (ECL-PF) for cleavable XL-MS data analysis. ECL-PF adopts an optimized α/ß mass detection scheme and establishes protein-peptide association during the identification of cross-linked peptides. Existing major scoring functions can all benefit from the ECL-PF workflow to a great extent. In comparisons using synthetic data sets and hybrid simulated data sets, ECL-PF achieved 3-fold higher sensitivity over standard techniques. In experiments using real data sets, it also identified 65.6% more cross-link spectrum matches and 48.7% more unique cross-links.


Subject(s)
Peptides , Proteins , Feedback , Proteins/chemistry , Peptides/analysis , Mass Spectrometry/methods , Cross-Linking Reagents/chemistry
6.
Brief Bioinform ; 21(4): 1448-1454, 2020 07 15.
Article in English | MEDLINE | ID: mdl-31267129

ABSTRACT

For genome-wide CRISPR off-target cleavage sites (OTS) prediction, an important issue is data imbalance-the number of true OTS recognized by whole-genome off-target detection techniques is much smaller than that of all possible nucleotide mismatch loci, making the training of machine learning model very challenging. Therefore, computational models proposed for OTS prediction and scoring should be carefully designed and properly evaluated in order to avoid bias. In our study, two tools are taken as examples to further emphasize the data imbalance issue in CRISPR off-target prediction to achieve better sensitivity and specificity for optimized CRISPR gene editing. We would like to indicate that (1) the benchmark of CRISPR off-target prediction should be properly evaluated and not overestimated by considering data imbalance issue; (2) incorporation of efficient computational techniques (including ensemble learning and data synthesis techniques) can help to address the data imbalance issue and improve the performance of CRISPR off-target prediction. Taking together, we call for more efforts to address the data imbalance issue in CRISPR off-target prediction to facilitate clinical utility of CRISPR-based gene editing techniques.


Subject(s)
Clustered Regularly Interspaced Short Palindromic Repeats , Gene Editing/methods , Machine Learning
7.
Cardiovasc Diabetol ; 21(1): 293, 2022 12 31.
Article in English | MEDLINE | ID: mdl-36587202

ABSTRACT

OBJECTIVE: High-density lipoproteins (HDL) comprise particles of different size, density and composition and their vasoprotective functions may differ. Diabetes modifies the composition and function of HDL. We assessed associations of HDL size-based subclasses with incident cardiovascular disease (CVD) and mortality and their prognostic utility. RESEARCH DESIGN AND METHODS: HDL subclasses by nuclear magnetic resonance spectroscopy were determined in sera from 1991 fasted adults with type 2 diabetes (T2D) consecutively recruited from March 2014 to February 2015 in Hong Kong. HDL was divided into small, medium, large and very large subclasses. Associations (per SD increment) with outcomes were evaluated using multivariate Cox proportional hazards models. C-statistic, integrated discrimination index (IDI), and categorial and continuous net reclassification improvement (NRI) were used to assess predictive value. RESULTS: Over median (IQR) 5.2 (5.0-5.4) years, 125 participants developed incident CVD and 90 participants died. Small HDL particles (HDL-P) were inversely associated with incident CVD [hazard ratio (HR) 0.65 (95% CI 0.52, 0.81)] and all-cause mortality [0.47 (0.38, 0.59)] (false discovery rate < 0.05). Very large HDL-P were positively associated with all-cause mortality [1.75 (1.19, 2.58)]. Small HDL-P improved prediction of mortality [C-statistic 0.034 (0.013, 0.055), IDI 0.052 (0.014, 0.103), categorical NRI 0.156 (0.006, 0.252), and continuous NRI 0.571 (0.246, 0.851)] and CVD [IDI 0.017 (0.003, 0.038) and continuous NRI 0.282 (0.088, 0.486)] over the RECODe model. CONCLUSION: Small HDL-P were inversely associated with incident CVD and all-cause mortality and improved risk stratification for adverse outcomes in people with T2D. HDL-P may be used as markers for residual risk in people with T2D.


Subject(s)
Cardiovascular Diseases , Diabetes Mellitus, Type 2 , Adult , Humans , Diabetes Mellitus, Type 2/diagnosis , Biological Specimen Banks , Hong Kong/epidemiology , Risk Factors , Lipoproteins, HDL , Cholesterol, HDL
8.
Development ; 144(12): 2153-2164, 2017 06 15.
Article in English | MEDLINE | ID: mdl-28506995

ABSTRACT

Cell delamination is a conserved morphogenetic process important for the generation of cell diversity and maintenance of tissue homeostasis. Here, we used Drosophila embryonic neuroblasts as a model to study the apical constriction process during cell delamination. We observe dynamic myosin signals both around the cell adherens junctions and underneath the cell apical surface in the neuroectoderm. On the cell apical cortex, the nonjunctional myosin forms flows and pulses, which are termed medial myosin pulses. Quantitative differences in medial myosin pulse intensity and frequency are crucial to distinguish delaminating neuroblasts from their neighbors. Inhibition of medial myosin pulses blocks delamination. The fate of a neuroblast is set apart from that of its neighbors by Notch signaling-mediated lateral inhibition. When we inhibit Notch signaling activity in the embryo, we observe that small clusters of cells undergo apical constriction and display an abnormal apical myosin pattern. Together, these results demonstrate that a contractile actomyosin network across the apical cell surface is organized to drive apical constriction in delaminating neuroblasts.


Subject(s)
Drosophila Proteins/metabolism , Drosophila melanogaster/embryology , Drosophila melanogaster/metabolism , Myosins/metabolism , Neural Stem Cells/metabolism , Animals , Animals, Genetically Modified , Apoptosis , Cell Differentiation , Drosophila melanogaster/cytology , Models, Neurological , Morphogenesis/physiology , Neural Stem Cells/cytology , Neurogenesis/physiology , Receptors, Notch/metabolism , Signal Transduction
9.
Bioinformatics ; 35(2): 251-257, 2019 01 15.
Article in English | MEDLINE | ID: mdl-30649350

ABSTRACT

Motivation: Cross-linking technique coupled with mass spectrometry (MS) is widely used in the analysis of protein structures and protein-protein interactions. In order to identify cross-linked peptides from MS data, we need to consider all pairwise combinations of peptides, which is computationally prohibitive when the sequence database is large. To alleviate this problem, some heuristic screening strategies are used to reduce the number of peptide pairs during the identification. However, heuristic screening strategies may miss some true cross-linked peptides. Results: We directly tackle the combination challenge without using any screening strategies. With the data structure of double-ended queue, the proposed algorithm reduces the quadratic time complexity of exhaustive searching down to the linear time complexity. We implement the algorithm in a tool named Xolik. The running time of Xolik is validated using databases with different numbers of proteins. Experiments using synthetic and empirical datasets show that Xolik outperforms existing tools in terms of running time and statistical power. Availability and implementation: Source code and binaries of Xolik are freely available at http://bioinformatics.ust.hk/Xolik.html. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Databases, Protein , Peptides/chemistry , Protein Interaction Mapping/methods , Proteins/chemistry , Software , Algorithms , Computational Biology , Mass Spectrometry
10.
Mol Cell Proteomics ; 17(5): 1010-1027, 2018 05.
Article in English | MEDLINE | ID: mdl-29440448

ABSTRACT

Protein acetylation, one of many types of post-translational modifications (PTMs), is involved in a variety of biological and cellular processes. In the present study, we applied both CsCl density gradient (CDG) centrifugation-based protein fractionation and a dimethyl-labeling-based 4C quantitative PTM proteomics workflow in the study of dynamic acetylproteomic changes in Arabidopsis. This workflow integrates the dimethyl chemical labeling with chromatography-based acetylpeptide separation and enrichment followed by mass spectrometry (MS) analysis, the extracted ion chromatogram (XIC) quantitation-based computational analysis of mass spectrometry data to measure dynamic changes of acetylpeptide level using an in-house software program, named Stable isotope-based Quantitation-Dimethyl labeling (SQUA-D), and finally the confirmation of ethylene hormone-regulated acetylation using immunoblot analysis. Eventually, using this proteomic approach, 7456 unambiguous acetylation sites were found from 2638 different acetylproteins, and 5250 acetylation sites, including 5233 sites on lysine side chain and 17 sites on protein N termini, were identified repetitively. Out of these repetitively discovered acetylation sites, 4228 sites on lysine side chain (i.e. 80.5%) are novel. These acetylproteins are exemplified by the histone superfamily, ribosomal and heat shock proteins, and proteins related to stress/stimulus responses and energy metabolism. The novel acetylproteins enriched by the CDG centrifugation fractionation contain many cellular trafficking proteins, membrane-bound receptors, and receptor-like kinases, which are mostly involved in brassinosteroid, light, gravity, and development signaling. In addition, we identified 12 highly conserved acetylation site motifs within histones, P-glycoproteins, actin depolymerizing factors, ATPases, transcription factors, and receptor-like kinases. Using SQUA-D software, we have quantified 33 ethylene hormone-enhanced and 31 hormone-suppressed acetylpeptide groups or called unique PTM peptide arrays (UPAs) that share the identical unique PTM site pattern (UPSP). This CDG centrifugation protein fractionation in combination with dimethyl labeling-based quantitative PTM proteomics, and SQUA-D may be applied in the quantitation of any PTM proteins in any model eukaryotes and agricultural crops as well as tissue samples of animals and human beings.


Subject(s)
Arabidopsis Proteins/metabolism , Arabidopsis/metabolism , Proteomics/methods , Staining and Labeling , Acetylation , Amino Acid Sequence , Chromatography, Liquid , Computational Biology , Ethylenes/pharmacology , Histones/metabolism , Methylation , Reproducibility of Results , Tandem Mass Spectrometry
11.
Brief Bioinform ; 18(6): 928-939, 2017 Nov 01.
Article in English | MEDLINE | ID: mdl-27687799

ABSTRACT

The goal of genome-wide association studies (GWASs) is to discover genetic variants associated with diseases/traits. Replication is a common validation method in GWASs. We regard an association as true finding when it shows significance in both primary and replication studies. A question worth pondering is what is the probability of a primary association (i.e. a statistically significant association in the primary study) being validated in the replication study? This article systematically reviews the answers to this question from different points of view. As Bayesian methods can help us integrate out the uncertainty about the underlying effect of the primary association, we will mainly focus on the Bayesian view in this article. We refer the Bayesian replication probability as the replication rate (RR). We further describe an estimation method for RR, which makes use of the summary statistics from the primary study. We can use the estimated RR to determine the sample size of the replication study and to check the consistency between the results of the primary study and those of the replication study. We describe an R-package to estimate and apply RR in GWASs. Simulation and real data experiments show that the estimated RR has good prediction and calibration performance. We also use these data to demonstrate the usefulness of RR. The R-package is available at http://bioinformatics.ust.hk/RRate.html.


Subject(s)
Algorithms , Bayes Theorem , Diabetes Mellitus, Type 2/genetics , Genetic Predisposition to Disease , Genome-Wide Association Study/methods , Polymorphism, Single Nucleotide , Case-Control Studies , Computer Simulation , Humans , Reproducibility of Results
12.
Bioinformatics ; 34(10): 1741-1749, 2018 05 15.
Article in English | MEDLINE | ID: mdl-29329369

ABSTRACT

Motivation: Individual genetic variants explain only a small fraction of heritability in some diseases. Some variants have weak marginal effects on disease risk, but their joint effects are significantly stronger when occurring together. Most studies on such epistatic interactions have focused on methods for identifying the interactions and interpreting individual cases, but few have explored their general functional basis. This was due to the lack of a comprehensive list of epistatic interactions and uncertainties in associating variants to genes. Results: We conducted a large-scale survey of published research articles to compile the first comprehensive list of epistatic interactions in human diseases with detailed annotations. We used various methods to associate these variants to genes to ensure robustness. We found that these genes are significantly more connected in protein interaction networks, are more co-expressed and participate more often in the same pathways. We demonstrate using the list to discover novel disease pathways. Contact: kevinyip@cse.cuhk.edu.hk. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Disease Susceptibility , Epistasis, Genetic , Proteins/genetics , Humans , Proteins/analysis , Software
13.
J Proteome Res ; 17(9): 3195-3213, 2018 09 07.
Article in English | MEDLINE | ID: mdl-30084631

ABSTRACT

An in planta chemical cross-linking-based quantitative interactomics (IPQCX-MS) workflow has been developed to investigate in vivo protein-protein interactions and alteration in protein structures in a model organism, Arabidopsis thaliana. A chemical cross-linker, azide-tag-modified disuccinimidyl pimelate (AMDSP), was directly applied onto Arabidopsis tissues. Peptides produced from protein fractions of CsCl density gradient centrifugation were dimethyl-labeled, from which the AMDSP cross-linked peptides were fractionated on chromatography, enriched, and analyzed by mass spectrometry. ECL2 and SQUA-D software were used to identify and quantitate these cross-linked peptides, respectively. These computer programs integrate peptide identification with quantitation and statistical evaluation. This workflow eventually identified 354 unique cross-linked peptides, including 61 and 293 inter- and intraprotein cross-linked peptides, respectively, demonstrating that it is able to in vivo identify hundreds of cross-linked peptides at an organismal level by overcoming the difficulties caused by multiple cellular structures and complex secondary metabolites of plants. Coimmunoprecipitation and super-resolution microscopy studies have confirmed the PHB3-PHB6 protein interaction found by IPQCX-MS. The quantitative interactomics also found hormone-induced structural changes of SBPase and other proteins. This mass-spectrometry-based interactomics will be useful in the study of in vivo protein-protein interaction networks in agricultural crops and plant-microbe interactions.


Subject(s)
Arabidopsis/metabolism , Gene Expression Regulation, Plant , Protein Interaction Mapping/methods , Proteome/metabolism , Repressor Proteins/metabolism , Amino Acid Sequence , Arabidopsis/genetics , Arabidopsis Proteins , Chromatography, Liquid , Cross-Linking Reagents/chemistry , Models, Molecular , Peptides/analysis , Peptides/chemistry , Prohibitins , Protein Binding , Protein Isoforms/chemistry , Protein Isoforms/genetics , Protein Isoforms/metabolism , Protein Structure, Secondary , Proteolysis , Proteome/chemistry , Proteome/genetics , Repressor Proteins/chemistry , Repressor Proteins/genetics , Staining and Labeling/methods , Succinimides/chemistry , Tandem Mass Spectrometry
14.
Bioinformatics ; 33(4): 500-507, 2017 02 15.
Article in English | MEDLINE | ID: mdl-28011772

ABSTRACT

Motivation: In genome-wide association studies (GWASs) of common diseases/traits, we often analyze multiple GWASs with the same phenotype together to discover associated genetic variants with higher power. Since it is difficult to access data with detailed individual measurements, summary-statistics-based meta-analysis methods have become popular to jointly analyze datasets from multiple GWASs. Results: In this paper, we propose a novel summary-statistics-based joint analysis method based on controlling the joint local false discovery rate (Jlfdr). We prove that our method is the most powerful summary-statistics-based joint analysis method when controlling the false discovery rate at a certain level. In particular, the Jlfdr-based method achieves higher power than commonly used meta-analysis methods when analyzing heterogeneous datasets from multiple GWASs. Simulation experiments demonstrate the superior power of our method over meta-analysis methods. Also, our method discovers more associations than meta-analysis methods from empirical datasets of four phenotypes. Availability and Implementation: The R-package is available at: http://bioinformatics.ust.hk/Jlfdr.html . Contact: eeyu@ust.hk. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Genetic Diseases, Inborn/genetics , Genetic Variation , Genome-Wide Association Study/methods , Software , Computer Simulation , Humans , Meta-Analysis as Topic , Phenotype
15.
J Proteome Res ; 16(10): 3942-3952, 2017 10 06.
Article in English | MEDLINE | ID: mdl-28825304

ABSTRACT

Chemical cross-linking coupled to mass spectrometry is a powerful tool to study protein-protein interactions and protein conformations. Two linked peptides are ionized and fragmented to produce a tandem mass spectrum. In such an experiment, a tandem mass spectrum contains ions from two peptides. The peptide identification problem becomes a peptide-peptide pair identification problem. Currently, most tools do not search all possible pairs due to the quadratic time complexity. Consequently, missed findings are unavoidable. In our previous work, we developed a tool named ECL to search all pairs of peptides exhaustively. Unfortunately, it is very slow due to the quadratic computational complexity, especially when the database is large. Furthermore, ECL uses a score function without statistical calibration, while researchers1-3 have proposed that it is inappropriate to directly compare uncalibrated scores because different spectra have different random score distributions. Here we propose an advanced version of ECL, named ECL2. It achieves a linear time and space complexity by taking advantage of the additive property of a score function. It can search a data set containing tens of thousands of spectra against a database containing thousands of proteins in a few hours. Comparison with other five state-of-the-art tools shows that ECL2 is much faster than pLink, StavroX, ProteinProspector, and ECL. Kojak is the only one that is faster than ECL2, but Kojak does not exhaustively search all possible peptide pairs. The comparison shows that ECL2 has the highest sensitivity among the state-of-the-art tools. The experiment using a large-scale in vivo cross-linking data set demonstrates that ECL2 is the only tool that can find the peptide-spectrum matches (PSMs) passing the false discovery rate/q-value threshold. The result illustrates that the exhaustive search and a well-calibrated score function are useful to find PSMs from a huge search space.


Subject(s)
Cross-Linking Reagents/chemistry , Peptides/chemistry , Proteins/chemistry , Proteomics , Algorithms , Databases, Protein , Humans , Protein Conformation , Protein Interaction Maps/genetics , Software , Tandem Mass Spectrometry
16.
Proteomics ; 16(13): 1915-27, 2016 07.
Article in English | MEDLINE | ID: mdl-27198063

ABSTRACT

Site-specific chemical cross-linking in combination with mass spectrometry analysis has emerged as a powerful proteomic approach for studying the three-dimensional structure of protein complexes and in mapping protein-protein interactions (PPIs). Building on the success of MS analysis of in vitro cross-linked proteins, which has been widely used to investigate specific interactions of bait proteins and their targets in various organisms, we report a workflow for in vivo chemical cross-linking and MS analysis in a multicellular eukaryote. This approach optimizes the in vivo protein cross-linking conditions in Arabidopsis thaliana, establishes a MudPIT procedure for the enrichment of cross-linked peptides, and develops an integrated software program, exhaustive cross-linked peptides identification tool (ECL), to identify the MS spectra of in planta chemical cross-linked peptides. In total, two pairs of in vivo cross-linked peptides of high confidence have been identified from two independent biological replicates. This work demarks the beginning of an alternative proteomic approach in the study of in vivo protein tertiary structure and PPIs in multicellular eukaryotes.


Subject(s)
Arabidopsis Proteins/chemistry , Arabidopsis/metabolism , Cross-Linking Reagents/chemistry , Protein Interaction Mapping/methods , Proteomics/methods , Tandem Mass Spectrometry/methods , Amino Acid Sequence , Arabidopsis/chemistry , Arabidopsis Proteins/metabolism , Cross-Linking Reagents/metabolism , Models, Molecular , Peptides/analysis , Peptides/metabolism , Protein Conformation , Software
17.
BMC Bioinformatics ; 17(1): 217, 2016 May 20.
Article in English | MEDLINE | ID: mdl-27206479

ABSTRACT

BACKGROUND: Chemical cross-linking combined with mass spectrometry (CX-MS) is a high-throughput approach to studying protein-protein interactions. The number of peptide-peptide combinations grows quadratically with respect to the number of proteins, resulting in a high computational complexity. Widely used methods including xQuest (Rinner et al., Nat Methods 5(4):315-8, 2008; Walzthoeni et al., Nat Methods 9(9):901-3, 2012), pLink (Yang et al., Nat Methods 9(9):904-6, 2012), ProteinProspector (Chu et al., Mol Cell Proteomics 9:25-31, 2010; Trnka et al., 13(2):420-34, 2014) and Kojak (Hoopmann et al., J Proteome Res 14(5):2190-198, 2015) avoid searching all peptide-peptide combinations by pre-selecting peptides with heuristic approaches. However, pre-selection procedures may cause missing findings. The most intuitive approach is searching all possible candidates. A tool that can exhaustively search a whole database without any heuristic pre-selection procedure is therefore desirable. RESULTS: We have developed a cross-linked peptides identification tool named ECL. It can exhaustively search a whole database in a reasonable period of time without any heuristic pre-selection procedure. Tests showed that searching a database containing 5200 proteins took 7 h. ECL identified more non-redundant cross-linked peptides than xQuest, pLink, and ProteinProspector. Experiments showed that about 30 % of these additional identified peptides were not pre-selected by Kojak. We used protein crystal structures from the protein data bank to check the intra-protein cross-linked peptides. Most of the distances between cross-linking sites were smaller than 30 Å. CONCLUSIONS: To the best of our knowledge, ECL is the first tool that can exhaustively search all candidates in cross-linked peptides identification. The experiments showed that ECL could identify more peptides than xQuest, pLink, and ProteinProspector. A further analysis indicated that some of the additional identified results were thanks to the exhaustive search.


Subject(s)
Cross-Linking Reagents/chemistry , Databases, Protein , Peptides/chemistry , Search Engine , Humans
18.
J Proteome Res ; 15(12): 4423-4435, 2016 12 02.
Article in English | MEDLINE | ID: mdl-27748123

ABSTRACT

In computational proteomics, the identification of peptides with an unlimited number of post-translational modification (PTM) types is a challenging task. The computational cost associated with database search increases exponentially with respect to the number of modified amino acids and linearly with respect to the number of potential PTM types at each amino acid. The problem becomes intractable very quickly if we want to enumerate all possible PTM patterns. To address this issue, one group of methods named restricted tools (including Mascot, Comet, and MS-GF+) only allow a small number of PTM types in database search process. Alternatively, the other group of methods named unrestricted tools (including MS-Alignment, ProteinProspector, and MODa) avoids enumerating PTM patterns with an alignment-based approach to localizing and characterizing modified amino acids. However, because of the large search space and PTM localization issue, the sensitivity of these unrestricted tools is low. This paper proposes a novel method named PIPI to achieve PTM-invariant peptide identification. PIPI belongs to the category of unrestricted tools. It first codes peptide sequences into Boolean vectors and codes experimental spectra into real-valued vectors. For each coded spectrum, it then searches the coded sequence database to find the top scored peptide sequences as candidates. After that, PIPI uses dynamic programming to localize and characterize modified amino acids in each candidate. We used simulation experiments and real data experiments to evaluate the performance in comparison with restricted tools (i.e., Mascot, Comet, and MS-GF+) and unrestricted tools (i.e., Mascot with error tolerant search, MS-Alignment, ProteinProspector, and MODa). Comparison with restricted tools shows that PIPI has a close sensitivity and running speed. Comparison with unrestricted tools shows that PIPI has the highest sensitivity except for Mascot with error tolerant search and ProteinProspector. These two tools simplify the task by only considering up to one modified amino acid in each peptide, which results in a higher sensitivity but has difficulty in dealing with multiple modified amino acids. The simulation experiments also show that PIPI has the lowest false discovery proportion, the highest PTM characterization accuracy, and the shortest running time among the unrestricted tools.


Subject(s)
Computational Biology/methods , Protein Processing, Post-Translational , Proteomics/methods , Algorithms , Amino Acid Sequence , Animals , Computational Biology/standards , Computer Simulation , Databases, Protein , Humans , Software/standards
19.
BMC Genomics ; 17 Suppl 1: 3, 2016 Jan 11.
Article in English | MEDLINE | ID: mdl-26818952

ABSTRACT

BACKGROUND: Replication study is a commonly used verification method to filter out false positives in genome-wide association studies (GWAS). If an association can be confirmed in a replication study, it will have a high confidence to be true positive. To design a replication study, traditional approaches calculate power by treating replication study as another independent primary study. These approaches do not use the information given by primary study. Besides, they need to specify a minimum detectable effect size, which may be subjective. One may think to replace the minimum effect size with the observed effect sizes in the power calculation. However, this approach will make the designed replication study underpowered since we are only interested in the positive associations from the primary study and the problem of the "winner's curse" will occur. RESULTS: An Empirical Bayes (EB) based method is proposed to estimate the power of replication study for each association. The corresponding credible interval is estimated in the proposed approach. Simulation experiments show that our method is better than other plug-in based estimators in terms of overcoming the winner's curse and providing higher estimation accuracy. The coverage probability of given credible interval is well-calibrated in the simulation experiments. Weighted average method is used to estimate the average power of all underlying true associations. This is used to determine the sample size of replication study. Sample sizes are estimated on 6 diseases from Wellcome Trust Case Control Consortium (WTCCC) using our method. They are higher than sample sizes estimated by plugging observed effect sizes in power calculation. CONCLUSIONS: Our new method can objectively determine replication study's sample size by using information extracted from primary study. Also the winner's curse is alleviated. Thus, it is a better choice when designing replication studies of GWAS. The R-package is available at: http://bioinformatics.ust.hk/RPower.html .


Subject(s)
Genome, Human , Genome-Wide Association Study , Algorithms , Alleles , Arthritis, Rheumatoid/genetics , Bayes Theorem , Crohn Disease/genetics , Diabetes Mellitus/genetics , Genetic Predisposition to Disease , Humans , Polymorphism, Single Nucleotide , Vascular Diseases/genetics
20.
Bioinformatics ; 31(9): 1460-2, 2015 May 01.
Article in English | MEDLINE | ID: mdl-25535244

ABSTRACT

MOTIVATION: The importance of testing associations allowing for interactions has been demonstrated by Marchini et al. (2005). A fast method detecting associations allowing for interactions has been proposed by Wan et al. (2010a). The method is based on likelihood ratio test with the assumption that the statistic follows the χ(2) distribution. Many single nucleotide polymorphism (SNP) pairs with significant associations allowing for interactions have been detected using their method. However, the assumption of χ(2) test requires the expected values in each cell of the contingency table to be at least five. This assumption is violated in some identified SNP pairs. In this case, likelihood ratio test may not be applicable any more. Permutation test is an ideal approach to checking the P-values calculated in likelihood ratio test because of its non-parametric nature. The P-values of SNP pairs having significant associations with disease are always extremely small. Thus, we need a huge number of permutations to achieve correspondingly high resolution for the P-values. In order to investigate whether the P-values from likelihood ratio tests are reliable, a fast permutation tool to accomplish large number of permutations is desirable. RESULTS: We developed a permutation tool named PBOOST. It is based on GPU with highly reliable P-value estimation. By using simulation data, we found that the P-values from likelihood ratio tests will have relative error of >100% when 50% cells in the contingency table have expected count less than five or when there is zero expected count in any of the contingency table cells. In terms of speed, PBOOST completed 10(7) permutations for a single SNP pair from the Wellcome Trust Case Control Consortium (WTCCC) genome data (Wellcome Trust Case Control Consortium, 2007) within 1 min on a single Nvidia Tesla M2090 device, while it took 60 min in a single CPU Intel Xeon E5-2650 to finish the same task. More importantly, when simultaneously testing 256 SNP pairs for 10(7) permutations, our tool took only 5 min, while the CPU program took 10 h. By permuting on a GPU cluster consisting of 40 nodes, we completed 10(12) permutations for all 280 SNP pairs reported with P-values smaller than 1.6 × 10⁻¹² in the WTCCC datasets in 1 week. AVAILABILITY AND IMPLEMENTATION: The source code and sample data are available at http://bioinformatics.ust.hk/PBOOST.zip. CONTACT: gyang@ust.hk; eeyu@ust.hk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genome-Wide Association Study/methods , Polymorphism, Single Nucleotide , Software , Humans , Likelihood Functions
SELECTION OF CITATIONS
SEARCH DETAIL