Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 83
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
J Proteome Res ; 23(6): 1983-1999, 2024 Jun 07.
Artículo en Inglés | MEDLINE | ID: mdl-38728051

RESUMEN

In recent years, several deep learning-based methods have been proposed for predicting peptide fragment intensities. This study aims to provide a comprehensive assessment of six such methods, namely Prosit, DeepMass:Prism, pDeep3, AlphaPeptDeep, Prosit Transformer, and the method proposed by Guan et al. To this end, we evaluated the accuracy of the predicted intensity profiles for close to 1.7 million precursors (including both tryptic and HLA peptides) corresponding to more than 18 million experimental spectra procured from 40 independent submissions to the PRIDE repository that were acquired for different species using a variety of instruments and different dissociation types/energies. Specifically, for each method, distributions of similarity (measured by Pearson's correlation and normalized angle) between the predicted and the corresponding experimental b and y fragment intensities were generated. These distributions were used to ascertain the prediction accuracy and rank the prediction methods for particular types of experimental conditions. The effect of variables like precursor charge, length, and collision energy on the prediction accuracy was also investigated. In addition to prediction accuracy, the methods were evaluated in terms of prediction speed. The systematic assessment of these six methods may help in choosing the right method for MS/MS spectra prediction for particular needs.


Asunto(s)
Aprendizaje Profundo , Humanos , Fragmentos de Péptidos/química , Fragmentos de Péptidos/análisis , Espectrometría de Masas en Tándem/métodos , Espectrometría de Masas en Tándem/estadística & datos numéricos , Proteómica/métodos , Proteómica/estadística & datos numéricos
2.
J Proteome Res ; 20(3): 1476-1487, 2021 03 05.
Artículo en Inglés | MEDLINE | ID: mdl-33573382

RESUMEN

Simple light isotope metabolic labeling (SLIM labeling) is an innovative method to quantify variations in the proteome based on an original in vivo labeling strategy. Heterotrophic cells grown in U-[12C] as the sole source of carbon synthesize U-[12C]-amino acids, which are incorporated into proteins, giving rise to U-[12C]-proteins. This results in a large increase in the intensity of the monoisotope ion of peptides and proteins, thus allowing higher identification scores and protein sequence coverage in mass spectrometry experiments. This method, initially developed for signal processing and quantification of the incorporation rate of 12C into peptides, was based on a multistep process that was difficult to implement for many laboratories. To overcome these limitations, we developed a new theoretical background to analyze bottom-up proteomics data using SLIM-labeling (bSLIM) and established simple procedures based on open-source software, using dedicated OpenMS modules, and embedded R scripts to process the bSLIM experimental data. These new tools allow computation of both the 12C abundance in peptides to follow the kinetics of protein labeling and the molar fraction of unlabeled and 12C-labeled peptides in multiplexing experiments to determine the relative abundance of proteins extracted under different biological conditions. They also make it possible to consider incomplete 12C labeling, such as that observed in cells with nutritional requirements for nonlabeled amino acids. These tools were validated on an experimental dataset produced using various yeast strains of Saccharomyces cerevisiae and growth conditions. The workflows are built on the implementation of appropriate calculation modules in a KNIME working environment. These new integrated tools provide a convenient framework for the wider use of the SLIM-labeling strategy.


Asunto(s)
Proteoma , Proteómica , Secuencia de Aminoácidos , Marcaje Isotópico , Espectrometría de Masas
3.
Eur Phys J E Soft Matter ; 44(10): 129, 2021 Oct 18.
Artículo en Inglés | MEDLINE | ID: mdl-34661792

RESUMEN

Electrostatic interactions among colloidal particles are often described using the venerable (two-particle) Derjaguin-Landau-Verwey-Overbeek (DLVO) approximation and its various modifications. However, until the recent development of a many-body theory exact at the Debye-Hückel level (Yu in Phys Rev E 102:052404, 2020), it was difficult to assess the errors of such approximations and impossible to assess the role of many-body effects. By applying the exact Debye-Hückel level theory, we quantify the errors inherent to DLVO and the additional errors associated with replacing many-particle interactions by the sum of pairwise interactions (even when the latter are calculated exactly). In particular, we show that: (1) the DLVO approximation does not provide sufficient accuracy at shorter distances, especially when there is an asymmetry in charges and/or sizes of interacting dielectric spheres; (2) the pairwise approximation leads to significant errors at shorter distances and at large and moderate Debye lengths and also gets worse with increasing asymmetry in the size of the spheres or magnitude or placement of the charges. We also demonstrate that asymmetric dielectric screening, i.e., the enhanced repulsion between charged dielectric bodies immersed in media with high dielectric constant, is preserved in the presence of free ions in the medium.


Asunto(s)
Modelos Químicos , Iones , Electricidad Estática
4.
Proteomics ; 19(14): e1800367, 2019 07.
Artículo en Inglés | MEDLINE | ID: mdl-30908818

RESUMEN

Mass spectrometry-based proteomics starts with identifications of peptides and proteins, which provide the bases for forming the next-level hypotheses whose "validations" are often employed for forming even higher level hypotheses and so forth. Scientifically meaningful conclusions are thus attainable only if the number of falsely identified peptides/proteins is accurately controlled. For this reason, RAId continued to be developed in the past decade. RAId employs rigorous statistics for peptides/proteins identification, hence assigning accurate P-values/E-values that can be used confidently to control the number of falsely identified peptides and proteins. The RAId web service is a versatile tool built to identify peptides and proteins from tandem mass spectrometry data. Not only recognizing various spectra file formats, the web service also allows four peptide scoring functions and choice of three statistical methods for assigning P-values/E-values to identified peptides. Users may upload their own protein database or use one of the available knowledge integrated organismal databases that contain annotated information such as single amino acid polymorphisms, post-translational modifications, and their disease associations. The web service also provides a friendly interface to display, sort using different criteria, and download the identified peptides and proteins. RAId web service is freely available at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid.


Asunto(s)
Bases de Datos de Proteínas , Espectrometría de Masas/métodos , Proteómica/métodos , Biología Computacional
5.
Phys Rev Lett ; 121(18): 185505, 2018 Nov 02.
Artículo en Inglés | MEDLINE | ID: mdl-30444387

RESUMEN

Thermal expansion of H_{2}O and D_{2}O ice Ih with relative resolution of 1 ppb is reported. A large transition in the thermal expansion coefficient at 101 K in H_{2}O moves to 125 K in D_{2}O, revealing one of the largest-known isotope effects. Rotational oscillatory modes that couple poorly to phonons, i.e., lattice solitons, may be responsible.

6.
Bioinformatics ; 32(17): 2642-9, 2016 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-27153659

RESUMEN

MOTIVATION: There is a growing trend for biomedical researchers to extract evidence and draw conclusions from mass spectrometry based proteomics experiments, the cornerstone of which is peptide identification. Inaccurate assignments of peptide identification confidence thus may have far-reaching and adverse consequences. Although some peptide identification methods report accurate statistics, they have been limited to certain types of scoring function. The extreme value statistics based method, while more general in the scoring functions it allows, demands accurate parameter estimates and requires, at least in its original design, excessive computational resources. Improving the parameter estimate accuracy and reducing the computational cost for this method has two advantages: it provides another feasible route to accurate significance assessment, and it could provide reliable statistics for scoring functions yet to be developed. RESULTS: We have formulated and implemented an efficient algorithm for calculating the extreme value statistics for peptide identification applicable to various scoring functions, bypassing the need for searching large random databases. AVAILABILITY AND IMPLEMENTATION: The source code, implemented in C ++ on a linux system, is available for download at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbp/qmbp_ms/RAId/RAId_Linux_64Bit CONTACT: yyu@ncbi.nlm.nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Espectrometría de Masas , Péptidos , Proteómica , Bases de Datos de Proteínas , Humanos , Espectrometría de Masas en Tándem
7.
Bioinformatics ; 31(5): 699-706, 2015 Mar 01.
Artículo en Inglés | MEDLINE | ID: mdl-25362092

RESUMEN

MOTIVATION: Assigning statistical significance accurately has become increasingly important as metadata of many types, often assembled in hierarchies, are constructed and combined for further biological analyses. Statistical inaccuracy of metadata at any level may propagate to downstream analyses, undermining the validity of scientific conclusions thus drawn. From the perspective of mass spectrometry-based proteomics, even though accurate statistics for peptide identification can now be achieved, accurate protein level statistics remain challenging. RESULTS: We have constructed a protein ID method that combines peptide evidences of a candidate protein based on a rigorous formula derived earlier; in this formula the database P-value of every peptide is weighted, prior to the final combination, according to the number of proteins it maps to. We have also shown that this protein ID method provides accurate protein level E-value, eliminating the need of using empirical post-processing methods for type-I error control. Using a known protein mixture, we find that this protein ID method, when combined with the Soric formula, yields accurate values for the proportion of false discoveries. In terms of retrieval efficacy, the results from our method are comparable with other methods tested. AVAILABILITY AND IMPLEMENTATION: The source code, implemented in C++ on a linux system, is available for download at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbp/qmbp_ms/RAId/RAId_Linux_64Bit.


Asunto(s)
Algoritmos , Bases de Datos de Proteínas , Espectrometría de Masas/métodos , Modelos Estadísticos , Fragmentos de Péptidos/análisis , Proteínas/análisis , Proteómica/métodos , Humanos , Proteínas/metabolismo
8.
Bioinformatics ; 31(3): 324-31, 2015 Feb 01.
Artículo en Inglés | MEDLINE | ID: mdl-25294922

RESUMEN

MOTIVATION: DNA and protein patterns are usefully represented by sequence logos. However, the methods for logo generation in common use lack a proper statistical basis, and are non-optimal for recognizing functionally relevant alignment columns. RESULTS: We redefine the information at a logo position as a per-observation multiple alignment log-odds score. Such scores are positive or negative, depending on whether a column's observations are better explained as arising from relatedness or chance. Within this framework, we propose distinct normalized maximum likelihood and Bayesian measures of column information. We illustrate these measures on High Mobility Group B (HMGB) box proteins and a dataset of enzyme alignments. Particularly in the context of protein alignments, our measures improve the discrimination of biologically relevant positions. AVAILABILITY AND IMPLEMENTATION: Our new measures are implemented in an open-source Web-based logo generation program, which is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/logoddslogo/index.html. A stand-alone version of the program is also available from this site. CONTACT: altschul@ncbi.nlm.nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Teorema de Bayes , Posición Específica de Matrices de Puntuación , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Secuencia de Aminoácidos , Humanos , Anotación de Secuencia Molecular , Datos de Secuencia Molecular , Homología de Secuencia de Aminoácido
9.
Am J Phys ; 82(5): 460-465, 2014 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-25125701

RESUMEN

The problem of electrostatics in biomolecular systems presents an excellent opportunity for cross-disciplinary science and a context in which fundamental physics is called for to answer complex questions. Due to the large density in biological cells of charged biomacromolecules such as protein factors and DNA, it is challenging to understand quantitatively the electric forces in these systems. Two questions are especially puzzling. First, how is it that such a dense system of charged molecules does not simply aggregate in random and non-functional ways? Second, since some mechanism apparently prevents such aggregation, how is it that binding of biomolecules still occurs so reliably? Recognizing the role of water as a universal solvent in living systems is key to understanding these questions. We present a simplified physical model in which water is regarded as a medium of high dielectric constant that nevertheless exhibits the key features essential for answering the two questions presented. The answer to the first question lies in the strong screening ability of water, which reduces the energy scale of the electrostatic interactions. Furthermore, our model reveals the existence of asymmetric screening, a pronounced asymmetry between the screening for a system with like charges and that for a system with opposite charges, and this provides an answer to the second question.

10.
J Am Soc Mass Spectrom ; 35(6): 1138-1155, 2024 Jun 05.
Artículo en Inglés | MEDLINE | ID: mdl-38740383

RESUMEN

Having fast, accurate, and broad spectrum methods for the identification of microorganisms is of paramount importance to public health, research, and safety. Bottom-up mass spectrometer-based proteomics has emerged as an effective tool for the accurate identification of microorganisms from microbial isolates. However, one major hurdle that limits the deployment of this tool for routine clinical diagnosis, and other areas of research such as culturomics, is the instrument time required for the mass spectrometer to analyze a single sample, which can take ∼1 h per sample, when using mass spectrometers that are presently used in most institutes. To address this issue, in this study, we employed, for the first time, tandem mass tags (TMTs) in multiplex identifications of microorganisms from multiple TMT-labeled samples in one MS/MS experiment. A difficulty encountered when using TMT labeling is the presence of interference in the measured intensities of TMT reporter ions. To correct for interference, we employed in the proposed method a modified version of the expectation maximization (EM) algorithm that redistributes the signal from ion interference back to the correct TMT-labeled samples. We have evaluated the sensitivity and specificity of the proposed method using 94 MS/MS experiments (covering a broad range of protein concentration ratios across TMT-labeled channels and experimental parameters), containing a total of 1931 true positive TMT-labeled channels and 317 true negative TMT-labeled channels. The results of the evaluation show that the proposed method has an identification sensitivity of 93-97% and a specificity of 100% at the species level. Furthermore, as a proof of concept, using an in-house-generated data set composed of some of the most common urinary tract pathogens, we demonstrated that by using the proposed method the mass spectrometer time required per sample, using a 1 h LC-MS/MS run, can be reduced to 10 and 6 min when samples are labeled with TMT-6 and TMT-10, respectively. The proposed method can also be used along with Orbitrap mass spectrometers that have faster MS/MS acquisition rates, like the recently released Orbitrap Astral mass spectrometer, to further reduce the mass spectrometer time required per sample.


Asunto(s)
Algoritmos , Proteómica , Espectrometría de Masas en Tándem , Espectrometría de Masas en Tándem/métodos , Proteómica/métodos , Humanos , Bacterias/aislamiento & purificación , Bacterias/química , Proteínas Bacterianas/análisis , Proteínas Bacterianas/química , Proteínas Bacterianas/aislamiento & purificación
11.
J Comput Biol ; 31(2): 175-178, 2024 02.
Artículo en Inglés | MEDLINE | ID: mdl-38301204

RESUMEN

Although many user-friendly workflows exist for identifications of peptides and proteins in mass-spectrometry-based proteomics, there is a need of easy to use, fast, and accurate workflows for identifications of microorganisms, antimicrobial resistant proteins, and biomass estimation. Identification of microorganisms is a computationally demanding task that requires querying thousands of MS/MS spectra in a database containing thousands to tens of thousands of microorganisms. Existing software can't handle such a task in a time efficient manner, taking hours to process a single MS/MS experiment. Another paramount factor to consider is the necessity of accurate statistical significance to properly control the proportion of false discoveries among the identified microorganisms, and antimicrobial-resistant proteins, and to provide robust biomass estimation. Recently, we have developed Microorganism Classification and Identification (MiCId) workflow that assigns accurate statistical significance to identified microorganisms, antimicrobial-resistant proteins, and biomass estimation. MiCId's workflow is also computationally efficient, taking about 6-17 minutes to process a tandem mass-spectrometry (MS/MS) experiment using computer resources that are available in most laptop and desktop computers, making it a portable workflow. To make data analysis accessible to a broader range of users, beyond users familiar with the Linux environment, we have developed a graphical user interface (GUI) for MiCId's workflow. The GUI brings to users all the functionality of MiCId's workflow in a friendly interface along with tools for data analysis, visualization, and to export results.


Asunto(s)
Antiinfecciosos , Espectrometría de Masas en Tándem , Espectrometría de Masas en Tándem/métodos , Flujo de Trabajo , Programas Informáticos , Proteínas
12.
J Proteome Res ; 12(6): 2571-81, 2013 Jun 07.
Artículo en Inglés | MEDLINE | ID: mdl-23668635

RESUMEN

Because of its high specificity, trypsin is the enzyme of choice in shotgun proteomics. Nonetheless, several publications do report the identification of semitryptic and nontryptic peptides. Many of these peptides are thought to be signaling peptides or to have formed during sample preparation. It is known that only a small fraction of tandem mass spectra from a trypsin-digested protein mixture can be confidently matched to tryptic peptides. If other possibilities such as post-translational modifications and single-amino acid polymorphisms are ignored, this suggests that many unidentified spectra originate from semitryptic and nontryptic peptides. To include them in database searches, however, may not improve overall peptide identification because of the possible sensitivity reduction from search space expansion. To circumvent this issue for E-value-based search methods, we have designed a scheme that categorizes qualified peptides (i.e., peptides whose differences in molecular weight from the parent ion are within a specified error tolerance) into three tiers: tryptic, semitryptic, and nontryptic. This classification allows peptides that belong to different tiers to have different Bonferroni correction factors. Our results show that this scheme can significantly improve retrieval performance compared to those of search strategies that assign equal Bonferroni correction factors to all qualified peptides.


Asunto(s)
Algoritmos , Modelos Estadísticos , Anotación de Secuencia Molecular/estadística & datos numéricos , Fragmentos de Péptidos/aislamiento & purificación , Análisis de Secuencia de Proteína/estadística & datos numéricos , Animales , Humanos , Proteolisis , Proteómica , Sensibilidad y Especificidad , Espectrometría de Masas en Tándem , Tripsina/química
13.
Bioinformatics ; 28(6): 893-4, 2012 Mar 15.
Artículo en Inglés | MEDLINE | ID: mdl-22345616

RESUMEN

CytoSaddleSum provides Cytoscape users with access to the functionality of SaddleSum, a functional enrichment tool based on sum-of-weight scores. It operates by querying SaddleSum locally (using the standalone version) or remotely (through an HTTP request to a web server). The functional enrichment results are shown as a term relationship network, where nodes represent terms and edges show term relationships. Furthermore, query results are written as Cytoscape attributes allowing easy saving, retrieval and integration into network-based data analysis workflows.


Asunto(s)
Genes , Programas Informáticos , Eliminación de Gen , National Library of Medicine (U.S.) , Estados Unidos
14.
Rapid Commun Mass Spectrom ; 27(1): 152-6, 2013 Jan 15.
Artículo en Inglés | MEDLINE | ID: mdl-23239328

RESUMEN

RATIONALE: The hypothesis that dissociation energies can serve as a predictor of observability of b- and y-peaks is tested for seven hexapeptides. If the hypothesis holds true for large classes of peptides, one would be able to improve the scoring accuracy of peptide identification tools by excluding theoretical peaks that cannot be observed in practical product ion spectra due to various physical, chemical or thermodynamic considerations. METHODS: Product ion m/z spectra of hexapeptides AAAAAA, AAAFAA, AAAVAA, AAFAAA, AAVAAA, AAFFAA and AAVVAA have been acquired on a Finnigan LTQ XL mass spectrometer in the collision-induced dissociation (CID) activation mode on a grid of activation times 0.05 to 100 ms and normalized collision energy 10 to 35%. Dissociation energies were calculated for all fragmentation channels leading to b- and y-fragments at the TPSS/6-31G(d,p) level of the density functional theory. RESULTS: It was demonstrated that the m/z peaks observed in the product ion spectra correspond to the fragmentation channels with dissociation energies below a certain threshold value. However, there is no direct correlation between the most intense m/z peaks and the lowest dissociation energies. Using the dissociation energies, it was predicted that out of 63 theoretically possible peaks in the b- and y-series of the seven hexapeptides, 19 should not be observable in practical spectra. In the experiments, 24 peaks were not observed, including all 19 predicted. CONCLUSIONS: Dissociation energies alone are not sufficient for predicting ion intensity relationships in product ion m/z spectra. Nevertheless, the present data suggest that dissociation energies appear to be good predictors of observability of b- and y-peaks and potentially very useful for filtering theoretical peaks of each candidate peptide in peptide identification tools. Published 2012. This article is a US Government work and is in the public domain in the USA.


Asunto(s)
Espectrometría de Masas/métodos , Oligopéptidos/química , Iones/química , Termodinámica
15.
Cancer Inform ; 22: 11769351231159893, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37008073

RESUMEN

Motivation: The PAM50 signature/method is widely used for intrinsic subtyping of breast cancer samples. However, depending on the number and composition of the samples included in a cohort, the method may assign different subtypes to the same sample. This lack of robustness is mainly due to the fact that PAM50 subtracts a reference profile, which is computed using all samples in the cohort, from each sample before classification. In this paper we propose modifications to PAM50 to develop a simple and robust single-sample classifier, called MPAM50, for intrinsic subtyping of breast cancer. Like PAM50, the modified method uses a nearest centroid approach for classification, but the centroids are computed differently, and the distances to the centroids are determined using an alternative method. Additionally, MPAM50 uses unnormalized expression values for classification and does not subtract a reference profile from the samples. In other words, MPAM50 classifies each sample independently, and so avoids the previously mentioned robustness issue. Results: A training set was employed to find the new MPAM50 centroids. MPAM50 was then tested on 19 independent datasets (obtained using various expression profiling technologies) containing 9637 samples. Overall good agreement was observed between the PAM50- and MPAM50-assigned subtypes with a median accuracy of 0.792, which (we show) is comparable with the median concordance between various implementations of PAM50. Additionally, MPAM50- and PAM50-assigned intrinsic subtypes were found to agree comparably with the reported clinical subtypes. Also, survival analyses indicated that MPAM50 preserves the prognostic value of the intrinsic subtypes. These observations demonstrate that MPAM50 can replace PAM50 without loss of performance. On the other hand, MPAM50 was compared with 2 previously published single-sample classifiers, and with 3 alternative modified PAM50 approaches. The results indicated a superior performance by MPAM50. Conclusions: MPAM50 is a robust, simple, and accurate single-sample classifier of intrinsic subtypes of breast cancer.

16.
Rapid Commun Mass Spectrom ; 26(8): 915-20, 2012 Apr 30.
Artículo en Inglés | MEDLINE | ID: mdl-22396027

RESUMEN

RATIONALE: Peptide identification reliability can be improved by excluding from analysis those m/z peaks of candidate peptides which cannot be observed in practice due to various physical, chemical or thermodynamic considerations. We propose using dissociation energies (as opposed to proton affinities) as a predictor of observability of different m/z peaks in spectra of short peptides. METHODS: Mass spectra of the tetrapeptides AAAA, AAFA, AAVA, AFAA, AVAA, AFFA, and AVVA were measured in the collision-induced dissociation (CID) activation mode on a grid of activation times 0.05 to 100 ms and normalized collision energy 10 to 35%. The lowest energy geometries and vibrational spectra were calculated for the precursor ions and their charged and neutral fragments using density functional theory (DFT) at the TPSS/6-31G(d,p) level. Dissociation energies were calculated for all fragmentation channels leading to b- or y-fragments. RESULTS: It is demonstrated that m/z peaks observed in the mass spectra correspond to the fragmentation channels with the lowest dissociation energies. Using 50 kcal/mol as the cut-off value of dissociation energy, it was predicted that 28 out of 42 possible peaks in the b- and y-series of the seven tetrapeptides can be observed in mass spectra. In the experiments, 26 b- or y-peaks were observed, all of which are among the 28 predicted ones. CONCLUSIONS: The use of dissociation energies generalizes the use of proton affinities for semi-quantitative predictions of relative intensities of different m/z peaks of short peptides. Further advances in this direction will pave the way for reliable quantitative predictions and, hence, for a significant improvement in robustness and accuracy of peptide and protein identification tools.


Asunto(s)
Espectrometría de Masas/métodos , Mapeo Peptídico/métodos , Péptidos/química , Cinética
17.
Cancer Inform ; 21: 11769351221100718, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35722224

RESUMEN

Motivation: The precise diagnosis of the major subtypes, lung adenocarcinoma and lung squamous cell carcinoma, of non-small-cell lung cancer is of practical importance as some treatments are subtype-specific. However, in some cases diagnosis via the commonly-used method, that is staining the specimen using immunohistochemical markers, may be challenging. Hence, having a computational method that complements the diagnosis is desirable. In this paper, we propose a gene signature for this purpose. Results: We developed an expression-based method that systematically suggests a huge set of candidate gene signatures and finds the best candidate. By applying this method to a training set, the optimal gene signature was found by considering close to 765 billion candidate signatures. The 8-gene signature found for classifying the 2 aforementioned subtypes comprises TP63, CALML3, KRT5, PKP1, TESC, SPINK1, C9orf152, and KRT7. The signature achieved a high overall prediction accuracy of 0.936 when tested using 34 independent gene expression datasets obtained using different technologies and comprising 2556 adenocarcinoma and 1630 squamous cell carcinoma samples. Additionally, the signature performed well in clinically challenging cases, that is poorly differentiated tumors and specimens obtained from biopsies. In comparison with 2 previously reported signatures, our signature performed better in terms of overall accuracy and especially accuracy of classifying lung squamous cell carcinoma. Conclusions: Our signature is easy to use and accurate regardless of the technology used to obtain the gene expression profiles. It performs well even in clinically challenging cases and thus can assist pathologists in diagnosis of the ambiguous cases.

18.
J Am Soc Mass Spectrom ; 33(6): 917-931, 2022 Jun 01.
Artículo en Inglés | MEDLINE | ID: mdl-35500907

RESUMEN

Fast and accurate identifications of pathogenic bacteria along with their associated antibiotic resistance proteins are of paramount importance for patient treatments and public health. To meet this goal from the mass spectrometry aspect, we have augmented the previously published Microorganism Classification and Identification (MiCId) workflow for this capability. To evaluate the performance of this augmented workflow, we have used MS/MS datafiles from samples of 10 antibiotic resistance bacterial strains belonging to three different species: Escherichia coli, Klebsiella pneumoniae, and Pseudomonas aeruginosa. The evaluation shows that MiCId's workflow has a sensitivity value around 85% (with a lower bound at about 72%) and a precision greater than 95% in identifying antibiotic resistance proteins. In addition to having high sensitivity and precision, MiCId's workflow is fast and portable, making it a valuable tool for rapid identifications of bacteria as well as detection of their antibiotic resistance proteins. It performs microorganismal identifications, protein identifications, sample biomass estimates, and antibiotic resistance protein identifications in 6-17 min per MS/MS sample using computing resources that are available in most desktop and laptop computers. We have also demonstrated other use of MiCId's workflow. Using MS/MS data sets from samples of two bacterial clonal isolates, one being antibiotic-sensitive while the other being multidrug-resistant, we applied MiCId's workflow to investigate possible mechanisms of antibiotic resistance in these pathogenic bacteria; the results showed that MiCId's conclusions agree with the published study. The new version of MiCId (v.07.01.2021) is freely available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html.


Asunto(s)
Proteómica , Espectrometría de Masas en Tándem , Antibacterianos/farmacología , Bacterias/química , Farmacorresistencia Bacteriana , Farmacorresistencia Microbiana , Escherichia coli , Humanos , Proteómica/métodos , Pseudomonas aeruginosa , Espectrometría de Masas en Tándem/métodos , Flujo de Trabajo
19.
Bioinformatics ; 26(21): 2752-9, 2010 Nov 01.
Artículo en Inglés | MEDLINE | ID: mdl-20826881

RESUMEN

MOTIVATION: Term-enrichment analysis facilitates biological interpretation by assigning to experimentally/computationally obtained data annotation associated with terms from controlled vocabularies. This process usually involves obtaining statistical significance for each vocabulary term and using the most significant terms to describe a given set of biological entities, often associated with weights. Many existing enrichment methods require selections of (arbitrary number of) the most significant entities and/or do not account for weights of entities. Others either mandate extensive simulations to obtain statistics or assume normal weight distribution. In addition, most methods have difficulty assigning correct statistical significance to terms with few entities. RESULTS: Implementing the well-known Lugananni-Rice formula, we have developed a novel approach, called SaddleSum, that is free from all the aforementioned constraints and evaluated it against several existing methods. With entity weights properly taken into account, SaddleSum is internally consistent and stable with respect to the choice of number of most significant entities selected. Making few assumptions on the input data, the proposed method is universal and can thus be applied to areas beyond analysis of microarrays. Employing asymptotic approximation, SaddleSum provides a term-size-dependent score distribution function that gives rise to accurate statistical significance even for terms with few entities. As a consequence, SaddleSum enables researchers to place confidence in its significance assignments to small terms that are often biologically most specific. AVAILABILITY: Our implementation, which uses Bonferroni correction to account for multiple hypotheses testing, is available at http://www.ncbi.nlm.nih.gov/CBBresearch/qmbp/mn/enrich/. Source code for the standalone version can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/qmbpmn/SaddleSum/.


Asunto(s)
Biología Computacional/métodos , Vocabulario Controlado , Algoritmos , Interpretación Estadística de Datos , Bases de Datos Factuales , Terminología como Asunto
20.
PLoS Comput Biol ; 6(7): e1000852, 2010 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-20657661

RESUMEN

Most pairwise and multiple sequence alignment programs seek alignments with optimal scores. Central to defining such scores is selecting a set of substitution scores for aligned amino acids or nucleotides. For local pairwise alignment, substitution scores are implicitly of log-odds form. We now extend the log-odds formalism to multiple alignments, using Bayesian methods to construct "BILD" ("Bayesian Integral Log-odds") substitution scores from prior distributions describing columns of related letters. This approach has been used previously only to define scores for aligning individual sequences to sequence profiles, but it has much broader applicability. We describe how to calculate BILD scores efficiently, and illustrate their uses in Gibbs sampling optimization procedures, gapped alignment, and the construction of hidden Markov model profiles. BILD scores enable automated selection of optimal motif and domain model widths, and can inform the decision of whether to include a sequence in a multiple alignment, and the selection of insertion and deletion locations. Other applications include the classification of related sequences into subfamilies, and the definition of profile-profile alignment scores. Although a fully realized multiple alignment program must rely upon more than substitution scores, many existing multiple alignment programs can be modified to employ BILD scores. We illustrate how simple BILD score based strategies can enhance the recognition of DNA binding domains, including the Api-AP2 domain in Toxoplasma gondii and Plasmodium falciparum.


Asunto(s)
Biología Computacional/métodos , Modelos Estadísticos , Reconocimiento de Normas Patrones Automatizadas/métodos , Alineación de Secuencia/métodos , Algoritmos , Secuencia de Aminoácidos , Secuencia de Bases , Teorema de Bayes , Secuencia de Consenso , Proteínas de Unión al ADN/química , Proteínas de Unión al ADN/genética , Bases de Datos Genéticas , Plasmodium , Estructura Terciaria de Proteína , Toxoplasma
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA