Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
1.
Nucleic Acids Res ; 43(Database issue): D1145-51, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25404128

RESUMO

MOPED (Multi-Omics Profiling Expression Database; http://moped.proteinspire.org) has transitioned from solely a protein expression database to a multi-omics resource for human and model organisms. Through a web-based interface, MOPED presents consistently processed data for gene, protein and pathway expression. To improve data quality, consistency and use, MOPED includes metadata detailing experimental design and analysis methods. The multi-omics data are integrated through direct links between genes and proteins and further connected to pathways and experiments. MOPED now contains over 5 million records, information for approximately 75,000 genes and 50,000 proteins from four organisms (human, mouse, worm, yeast). These records correspond to 670 unique combinations of experiment, condition, localization and tissue. MOPED includes the following new features: pathway expression, Pathway Details pages, experimental metadata checklists, experiment summary statistics and more advanced searching tools. Advanced searching enables querying for genes, proteins, experiments, pathways and keywords of interest. The system is enhanced with visualizations for comparing across different data types. In the future MOPED will expand the number of organisms, increase integration with pathways and provide connections to disease.


Assuntos
Bases de Dados Genéticas , Perfilação da Expressão Gênica , Proteômica , Animais , Humanos , Internet , Camundongos , Proteínas/genética , Proteínas/metabolismo
2.
J Proteome Res ; 13(1): 107-13, 2014 Jan 03.
Artigo em Inglês | MEDLINE | ID: mdl-24350770

RESUMO

The Model Organism Protein Expression Database (MOPED, http://moped.proteinspire.org) is an expanding proteomics resource to enable biological and biomedical discoveries. MOPED aggregates simple, standardized and consistently processed summaries of protein expression and metadata from proteomics (mass spectrometry) experiments from human and model organisms (mouse, worm, and yeast). The latest version of MOPED adds new estimates of protein abundance and concentration as well as relative (differential) expression data. MOPED provides a new updated query interface that allows users to explore information by organism, tissue, localization, condition, experiment, or keyword. MOPED supports the Human Proteome Project's efforts to generate chromosome- and diseases-specific proteomes by providing links from proteins to chromosome and disease information as well as many complementary resources. MOPED supports a new omics metadata checklist to harmonize data integration, analysis, and use. MOPED's development is driven by the user community, which spans 90 countries and guides future development that will transform MOPED into a multiomics resource. MOPED encourages users to submit data in a simple format. They can use the metadata checklist to generate a data publication for this submission. As a result, MOPED will provide even greater insights into complex biological processes and systems and enable deeper and more comprehensive biological and biomedical discoveries.


Assuntos
Bases de Dados de Proteínas , Proteômica , Animais , Humanos , Interface Usuário-Computador
3.
Nucleic Acids Res ; 40(Database issue): D1093-9, 2012 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22139914

RESUMO

Large numbers of mass spectrometry proteomics studies are being conducted to understand all types of biological processes. The size and complexity of proteomics data hinders efforts to easily share, integrate, query and compare the studies. The Model Organism Protein Expression Database (MOPED, htttp://moped.proteinspire.org) is a new and expanding proteomics resource that enables rapid browsing of protein expression information from publicly available studies on humans and model organisms. MOPED is designed to simplify the comparison and sharing of proteomics data for the greater research community. MOPED uniquely provides protein level expression data, meta-analysis capabilities and quantitative data from standardized analysis. Data can be queried for specific proteins, browsed based on organism, tissue, localization and condition and sorted by false discovery rate and expression. MOPED empowers users to visualize their own expression data and compare it with existing studies. Further, MOPED links to various protein and pathway databases, including GeneCards, Entrez, UniProt, KEGG and Reactome. The current version of MOPED contains over 43,000 proteins with at least one spectral match and more than 11 million high certainty spectra.


Assuntos
Bases de Dados de Proteínas , Proteínas/metabolismo , Animais , Humanos , Espectrometria de Massas , Camundongos , Modelos Animais , Proteômica , Interface Usuário-Computador
4.
OMICS ; 11(4): 351-65, 2007.
Artigo em Inglês | MEDLINE | ID: mdl-18092908

RESUMO

Determining the error rate for peptide and protein identification accurately and reliably is necessary to enable evaluation and crosscomparisons of high throughput proteomics experiments. Currently, peptide identification is based either on preset scoring thresholds or on probabilistic models trained on datasets that are often dissimilar to experimental results. The false discovery rates (FDR) and peptide identification probabilities for these preset thresholds or models often vary greatly across different experimental treatments, organisms, or instruments used in specific experiments. To overcome these difficulties, randomized databases have been used to estimate the FDR. However, the cumulative FDR may include low probability identifications when there are a large number of peptide identifications and exclude high probability identifications when there are few. To overcome this logical inconsistency, this study expands the use of randomized databases to generate experiment-specific estimates of peptide identification probabilities. These experiment-specific probabilities are generated by logistic and Loess regression models of the peptide scores obtained from original and reshuffled database matches. These experiment-specific probabilities are shown to very well approximate "true" probabilities based on known standard protein mixtures across different experiments. Probabilities generated by the earlier Peptide_Prophet and more recent LIPS models are shown to differ significantly from this study's experiment-specific probabilities, especially for unknown samples. The experiment-specific probabilities reliably estimate the accuracy of peptide identifications and overcome potential logical inconsistencies of the cumulative FDR. This estimation method is demonstrated using a Sequest database search, LIPS model, and a reshuffled database. However, this approach is generally applicable to any search algorithm, peptide scoring, and statistical model when using a randomized database.


Assuntos
Bases de Dados de Proteínas , Peptídeos/química , Algoritmos , Modelos Biológicos , Probabilidade , Distribuição Aleatória , Análise de Regressão , Software
5.
Proteomes ; 5(1)2017 Feb 03.
Artigo em Inglês | MEDLINE | ID: mdl-28248256

RESUMO

Medulloblastoma (MB) is the most common malignant pediatric brain tumor. Patient survival has remained largely the same for the past 20 years, with therapies causing significant health, cognitive, behavioral and developmental complications for those who survive the tumor. In this study, we profiled the total transcriptome and proteome of two established MB cell lines, Daoy and UW228, using high-throughput RNA sequencing (RNA-Seq) and label-free nano-LC-MS/MS-based quantitative proteomics, coupled with advanced pathway analysis. While Daoy has been suggested to belong to the sonic hedgehog (SHH) subtype, the exact UW228 subtype is not yet clearly established. Thus, a goal of this study was to identify protein markers and pathways that would help elucidate their subtype classification. A number of differentially expressed genes and proteins, including a number of adhesion, cytoskeletal and signaling molecules, were observed between the two cell lines. While several cancer-associated genes/proteins exhibited similar expression across the two cell lines, upregulation of a number of signature proteins and enrichment of key components of SHH and WNT signaling pathways were uniquely observed in Daoy and UW228, respectively. The novel information on differentially expressed genes/proteins and enriched pathways provide insights into the biology of MB, which could help elucidate their subtype classification.

6.
Big Data ; 4(1): 60-6, 2016 03.
Artigo em Inglês | MEDLINE | ID: mdl-27441585

RESUMO

This case study evaluates and tracks vitality of a city (Seattle), based on a data-driven approach, using strategic, robust, and sustainable metrics. This case study was collaboratively conducted by the Downtown Seattle Association (DSA) and CDO Analytics teams. The DSA is a nonprofit organization focused on making the city of Seattle and its Downtown a healthy and vibrant place to Live, Work, Shop, and Play. DSA primarily operates through public policy advocacy, community and business development, and marketing. In 2010, the organization turned to CDO Analytics ( cdoanalytics.org ) to develop a process that can guide and strategically focus DSA efforts and resources for maximal benefit to the city of Seattle and its Downtown. CDO Analytics was asked to develop clear, easily understood, and robust metrics for a baseline evaluation of the health of the city, as well as for ongoing monitoring and comparisons of the vitality, sustainability, and growth. The DSA and CDO Analytics teams strategized on how to effectively assess and track the vitality of Seattle and its Downtown. The two teams filtered a variety of data sources, and evaluated the veracity of multiple diverse metrics. This iterative process resulted in the development of a small number of strategic, simple, reliable, and sustainable metrics across four pillars of activity: Live, Work, Shop, and Play. Data during the 5 years before 2010 were used for the development of the metrics and model and its training, and data during the 5 years from 2010 and on were used for testing and validation. This work enabled DSA to routinely track these strategic metrics, use them to monitor the vitality of Downtown Seattle, prioritize improvements, and identify new value-added programs. As a result, the four-pillar approach became an integral part of the data-driven decision-making and execution of the Seattle community's improvement activities. The approach described in this case study is actionable, robust, inexpensive, and easy to adopt and sustain. It can be applied to cities, districts, counties, regions, states, or countries, enabling cross-comparisons and improvements of vitality, sustainability, and growth.


Assuntos
Planejamento de Cidades/métodos , Estudos de Casos Organizacionais , Humanos , Aprendizado de Máquina , Washington
7.
OMICS ; 9(3): 233-50, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-16209638

RESUMO

High-throughput protein analysis by tandem mass spectrometry produces anywhere from thousands to millions of spectra that are being used for peptide and protein identifications. Though each spectrum corresponds only to one charged peptide (ion) state, repetitive database searches of multiple charge states are typically conducted since the resolution of many common mass spectrometers is not sufficient to determine the charge state. The resulting database searches are both error-prone and time-consuming. We describe a straightforward, accurate approach on charge state estimation (CHASTE). CHASTE relies on fragment ion peak distributions, and by using reliable logistic regression models, combines different measurements to improve its accuracy. CHASTE's performance has been validated on data sets, comprised of known peptide dissociation spectra, obtained by replicate analyses of our earlier developed protein standard mixture using ion trap mass spectrometers at different laboratories. CHASTE was able to reduce number of needed database searches by at least 60% and the number of redundant searches by at least 90% virtually without any informational loss. This greatly alleviates one of the major bottlenecks in high throughput peptide and protein identifications. Thresholds and parameter estimates can be tailored to specific analysis situations, pipelines, and instrumentations. CHASTE was implemented in Java GUI-based and command-line-based interfaces.


Assuntos
Espectrometria de Massas , Proteômica/métodos , Gráficos por Computador , Bases de Dados de Proteínas , Peptídeos/análise , Valor Preditivo dos Testes , Proteínas/análise , Reprodutibilidade dos Testes , Software , Interface Usuário-Computador
8.
OMICS ; 19(12): 754-6, 2015 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26575978

RESUMO

Gene/disease associations are a critical part of exploring disease causes and ultimately cures, yet the publications that might provide such information are too numerous to be manually reviewed. We present a software utility, MOPED-Digger, that enables focused human assessment of literature by applying natural language processing (NLP) to search for customized lists of genes and diseases in titles and abstracts from biomedical publications. The results are ranked lists of gene/disease co-appearances and the publications that support them. Analysis of 18,159,237 PubMed title/abstracts yielded 1,796,799 gene/disease co-appearances that can be used to focus attention on the most promising publications for a possible gene/disease association. An integrated score is provided to enable assessment of broadly presented published evidence to capture more tenuous connections. MOPED-Digger is written in Java and uses Apache Lucene 5.0 library. The utility runs as a command-line program with a variety of user-options and is freely available for download from the MOPED 3.0 website (moped.proteinspire.org).


Assuntos
Biologia Computacional/métodos , Estudos de Associação Genética/métodos , Predisposição Genética para Doença , Software , Humanos
9.
OMICS ; 19(4): 197-208, 2015 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-25831060

RESUMO

Complex diseases are caused by a combination of genetic and environmental factors, creating a difficult challenge for diagnosis and defining subtypes. This review article describes how distinct disease subtypes can be identified through integration and analysis of clinical and multi-omics data. A broad shift toward molecular subtyping of disease using genetic and omics data has yielded successful results in cancer and other complex diseases. To determine molecular subtypes, patients are first classified by applying clustering methods to different types of omics data, then these results are integrated with clinical data to characterize distinct disease subtypes. An example of this molecular-data-first approach is in research on Autism Spectrum Disorder (ASD), a spectrum of social communication disorders marked by tremendous etiological and phenotypic heterogeneity. In the case of ASD, omics data such as exome sequences and gene and protein expression data are combined with clinical data such as psychometric testing and imaging to enable subtype identification. Novel ASD subtypes have been proposed, such as CHD8, using this molecular subtyping approach. Broader use of molecular subtyping in complex disease research is impeded by data heterogeneity, diversity of standards, and ineffective analysis tools. The future of molecular subtyping for ASD and other complex diseases calls for an integrated resource to identify disease mechanisms, classify new patients, and inform effective treatment options. This in turn will empower and accelerate precision medicine and personalized healthcare.


Assuntos
Transtorno do Espectro Autista/genética , Genômica , Medicina de Precisão , Transtorno do Espectro Autista/classificação , Transtorno do Espectro Autista/terapia , Análise por Conglomerados , Humanos , Tipagem Molecular
10.
OMICS ; 8(3): 255-65, 2004.
Artigo em Inglês | MEDLINE | ID: mdl-15669717

RESUMO

Current techniques in tandem mass spectrometric analyses of cellular protein contents often produce thousands to tens of thousands of spectra per experiment. This study introduces a new algorithm, named SPEQUAL, which is aimed at automated tandem mass spectral quality assessment. The quality of a given spectrum can be evaluated from three basic components: (i) charge state differentiation, (ii) total signal intensity, and (iii) signal-to-noise estimates. The differentiation between single and multiple precursor charge states (i) provides a binary score for a given spectrum. Components (ii) and (iii) provide partial scores which are subsequently summarized and multiplied by the first score. SPEQUAL was applied to over 10,000 data files derived from almost 3,000 tandem mass spectra, and the results (final cumulative scores) were manually verified. SPEQUAL's performance was determined to have high sensitivity and specificity and low error rates for both spectral quality estimates in general and precursor charge state differentiation in particular. Each of the partial scores is controlled by adjustable thresholds to fine-tune SPEQUAL's performance for different analysis pipelines and instrumentation. This spectral quality assessment tool is intended to act in an advisory role to the researcher, assisting in filtration of thousands of spectra typically produced by high throughput tandem mass spectrometric proteome analyses. Lastly, SPEQUAL was implemented as Java GUI-based and command-line-based interfaces freely available for both academic and industrial researchers.


Assuntos
Espectrometria de Massas/métodos , Proteômica , Controle de Qualidade , Espectrometria de Massas/normas
11.
OMICS ; 8(4): 357-69, 2004.
Artigo em Inglês | MEDLINE | ID: mdl-15703482

RESUMO

This study addresses the issue of peptide identification resulting from tandem mass spectrometry proteomics analysis followed by database search. This work shows that the Logistic Identification of Peptides (LIP) Index achieves high sensitivity and specificity for peptide classification relative to a manually verified "gold" standard and also accurately estimates the probability of a correct peptide match. The LIP Index is a weighted average of SEQUEST output variables based on logistic regression models and is a transparent, easy to use, inclusive, extendable, and statistically sound approach to classify correct peptide identifications. Modifications, such as normalizing cross-correlations (Xcorr) for peptide length, adjusting for charge state, and the number of tryptic termini, significantly improve the fit the logistic regression models, as well as increase sensitivity and specificity. The LIP Index also incorporates earlier developed statistical models on spectral quality assessment and peptide identification, which further improves sensitivity and specificity.


Assuntos
Biologia Computacional/métodos , Espectrometria de Massas/métodos , Peptídeos/química , Software , Algoritmos , Proteínas de Bactérias/química , Cromatografia Líquida , Bases de Dados como Assunto , Bases de Dados de Proteínas , Modelos Logísticos , Modelos Estatísticos , Modelos Teóricos , Probabilidade , Proteínas/química , Proteômica , Curva ROC , Sensibilidade e Especificidade , Tripsina/farmacologia
12.
Concurr Comput ; 26(13): 2112-2121, 2014 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-25313296

RESUMO

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

13.
OMICS ; 18(6): 335-43, 2014 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-24910945

RESUMO

Multi-omics data-driven scientific discovery crucially rests on high-throughput technologies and data sharing. Currently, data are scattered across single omics repositories, stored in varying raw and processed formats, and are often accompanied by limited or no metadata. The Multi-Omics Profiling Expression Database (MOPED, http://moped.proteinspire.org ) version 2.5 is a freely accessible multi-omics expression database. Continual improvement and expansion of MOPED is driven by feedback from the Life Sciences Community. In order to meet the emergent need for an integrated multi-omics data resource, MOPED 2.5 now includes gene relative expression data in addition to protein absolute and relative expression data from over 250 large-scale experiments. To facilitate accurate integration of experiments and increase reproducibility, MOPED provides extensive metadata through the Data-Enabled Life Sciences Alliance (DELSA Global, http://delsaglobal.org ) metadata checklist. MOPED 2.5 has greatly increased the number of proteomics absolute and relative expression records to over 500,000, in addition to adding more than four million transcriptomics relative expression records. MOPED has an intuitive user interface with tabs for querying different types of omics expression data and new tools for data visualization. Summary information including expression data, pathway mappings, and direct connection between proteins and genes can be viewed on Protein and Gene Details pages. These connections in MOPED provide a context for multi-omics expression data exploration. Researchers are encouraged to submit omics data which will be consistently processed into expression summaries. MOPED as a multi-omics data resource is a pivotal public database, interdisciplinary knowledge resource, and platform for multi-omics understanding.


Assuntos
Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Software , Animais , Humanos , Disseminação de Informação , Proteômica/métodos
14.
OMICS ; 18(1): 10-4, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24456465

RESUMO

Biological processes are fundamentally driven by complex interactions between biomolecules. Integrated high-throughput omics studies enable multifaceted views of cells, organisms, or their communities. With the advent of new post-genomics technologies, omics studies are becoming increasingly prevalent; yet the full impact of these studies can only be realized through data harmonization, sharing, meta-analysis, and integrated research. These essential steps require consistent generation, capture, and distribution of metadata. To ensure transparency, facilitate data harmonization, and maximize reproducibility and usability of life sciences studies, we propose a simple common omics metadata checklist. The proposed checklist is built on the rich ontologies and standards already in use by the life sciences community. The checklist will serve as a common denominator to guide experimental design, capture important parameters, and be used as a standard format for stand-alone data publications. The omics metadata checklist and data publications will create efficient linkages between omics data and knowledge-based life sciences innovation and, importantly, allow for appropriate attribution to data generators and infrastructure science builders in the post-genomics era. We ask that the life sciences community test the proposed omics metadata checklist and data publications and provide feedback for their use and improvement.


Assuntos
Disseminação de Informação/ética , Metagenômica/estatística & dados numéricos , Projetos de Pesquisa/normas , Mineração de Dados , Humanos , Metagenômica/economia , Metagenômica/tendências , Editoração , Reprodutibilidade dos Testes
15.
Big Data ; 1(1): 42-50, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-27447037

RESUMO

The life sciences have entered into the realm of big data and data-enabled science, where data can either empower or overwhelm. These data bring the challenges of the 5 Vs of big data: volume, veracity, velocity, variety, and value. Both independently and through our involvement with DELSA Global (Data-Enabled Life Sciences Alliance, DELSAglobal.org), the Kolker Lab ( kolkerlab.org ) is creating partnerships that identify data challenges and solve community needs. We specialize in solutions to complex biological data challenges, as exemplified by the community resource of MOPED (Model Organism Protein Expression Database, MOPED.proteinspire.org ) and the analysis pipeline of SPIRE (Systematic Protein Investigative Research Environment, PROTEINSPIRE.org ). Our collaborative work extends into the computationally intensive tasks of analysis and visualization of millions of protein sequences through innovative implementations of sequence alignment algorithms and creation of the Protein Sequence Universe tool (PSU). Pushing into the future together with our collaborators, our lab is pursuing integration of multi-omics data and exploration of biological pathways, as well as assigning function to proteins and porting solutions to the cloud. Big data have come to the life sciences; discovering the knowledge in the data will bring breakthroughs and benefits.

16.
Big Data ; 1(4): 237-44, 2013 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-27447256

RESUMO

Children with special healthcare needs (CSHCN) require health and related services that exceed those required by most hospitalized children. A small but growing and important subset of the CSHCN group includes medically complex children (MCCs). MCCs typically have comorbidities and disproportionately consume healthcare resources. To enable strategic planning for the needs of MCCs, simple screens to identify potential MCCs rapidly in a hospital setting are needed. We assessed whether the number of medications used and the class of those medications correlated with MCC status. Retrospective analysis of medication data from the inpatients at Seattle Children's Hospital found that the numbers of inpatient and outpatient medications significantly correlated with MCC status. Numerous variables based on counts of medications, use of individual medications, and use of combinations of medications were considered, resulting in a simple model based on three different counts of medications: outpatient and inpatient drug classes and individual inpatient drug names. The combined model was used to rank the patient population for medical complexity. As a result, simple, objective admission screens for predicting the complexity of patients based on the number and type of medications were implemented.

17.
Big Data ; 1(4): 196-201, 2013 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-27447251

RESUMO

Biological processes are fundamentally driven by complex interactions between biomolecules. Integrated high-throughput omics studies enable multifaceted views of cells, organisms, or their communities. With the advent of new post-genomics technologies, omics studies are becoming increasingly prevalent; yet the full impact of these studies can only be realized through data harmonization, sharing, meta-analysis, and integrated research. These essential steps require consistent generation, capture, and distribution of metadata. To ensure transparency, facilitate data harmonization, and maximize reproducibility and usability of life sciences studies, we propose a simple common omics metadata checklist. The proposed checklist is built on the rich ontologies and standards already in use by the life sciences community. The checklist will serve as a common denominator to guide experimental design, capture important parameters, and be used as a standard format for stand-alone data publications. The omics metadata checklist and data publications will create efficient linkages between omics data and knowledge-based life sciences innovation and, importantly, allow for appropriate attribution to data generators and infrastructure science builders in the post-genomics era. We ask that the life sciences community test the proposed omics metadata checklist and data publications and provide feedback for their use and improvement.

18.
J Proteomics ; 75(1): 122-6, 2011 Dec 10.
Artigo em Inglês | MEDLINE | ID: mdl-21609792

RESUMO

The SPIRE (Systematic Protein Investigative Research Environment) provides web-based experiment-specific mass spectrometry (MS) proteomics analysis (https://www.proteinspire.org). Its emphasis is on usability and integration of the best analytic tools. SPIRE provides an easy to use web-interface and generates results in both interactive and simple data formats. In contrast to run-based approaches, SPIRE conducts the analysis based on the experimental design. It employs novel methods to generate false discovery rates and local false discovery rates (FDR, LFDR) and integrates the best and complementary open-source search and data analysis methods. The SPIRE approach of integrating X!Tandem, OMSSA and SpectraST can produce an increase in protein IDs (52-88%) over current combinations of scoring and single search engines while also providing accurate multi-faceted error estimation. One of SPIRE's primary assets is combining the results with data on protein function, pathways and protein expression from model organisms. We demonstrate some of SPIRE's capabilities by analyzing mitochondrial proteins from the wild type and 3 mutants of C. elegans. SPIRE also connects results to publically available proteomics data through its Model Organism Protein Expression Database (MOPED). SPIRE can also provide analysis and annotation for user supplied protein ID and expression data.


Assuntos
Bases de Dados de Proteínas , Modelos Biológicos , Proteômica/métodos , Biologia de Sistemas/métodos , Animais , Caenorhabditis elegans/metabolismo , Proteínas de Caenorhabditis elegans/análise , Proteínas de Caenorhabditis elegans/química , Proteínas de Caenorhabditis elegans/genética , Espectrometria de Massas/métodos , Mitocôndrias/metabolismo , Proteínas Mitocondriais/análise , Proteínas Mitocondriais/química , Proteínas Mitocondriais/genética , Interface Usuário-Computador
19.
OMICS ; 15(7-8): 513-21, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21809957

RESUMO

To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.


Assuntos
Proteínas/classificação , Bases de Dados de Proteínas , Proteínas/química , Proteínas/metabolismo
20.
OMICS ; 15(4): 203-7, 2011 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-21476841

RESUMO

This article is a summary of the technology issues and challenges of data-intensive science and cloud computing as discussed in the Data-Intensive Science (DIS) workshop in Seattle, September 19-20, 2010.


Assuntos
Disciplinas das Ciências Biológicas/métodos , Tecnologia/métodos
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa