Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 50
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
J Proteome Res ; 17(6): 2131-2143, 2018 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-29671324

RESUMO

Traumatic brain injury (TBI) can occur across wide segments of the population, presenting in a heterogeneous manner that makes diagnosis inconsistent and management challenging. Biomarkers offer the potential to objectively identify injury status, severity, and phenotype by measuring the relative concentrations of endogenous molecules in readily accessible biofluids. Through a data-driven, discovery approach, novel biomarker candidates for TBI were identified in the serum lipidome of adult male Sprague-Dawley rats in the first week following moderate controlled cortical impact (CCI). Serum samples were analyzed in positive and negative modes by ultraperformance liquid chromatography-mass spectrometry (UPLC-MS). A predictive panel for the classification of injured and uninjured sera samples, consisting of 26 dysregulated species belonging to a variety of lipid classes, was developed with a cross-validated accuracy of 85.3% using omniClassifier software to optimize feature selection. Polyunsaturated fatty acids (PUFAs) and PUFA-containing diacylglycerols were found to be upregulated in sera from injured rats, while changes in sphingolipids and other membrane phospholipids were also observed, many of which map to known secondary injury pathways. Overall, the identified biomarker panel offers viable molecular candidates representing lipids that may readily cross the blood-brain barrier (BBB) and aid in the understanding of TBI pathophysiology.


Assuntos
Biomarcadores/sangue , Lesões Encefálicas Traumáticas/metabolismo , Metabolismo dos Lipídeos , Metabolômica/métodos , Animais , Lesões Encefálicas Traumáticas/sangue , Lesões Encefálicas Traumáticas/diagnóstico , Cromatografia Líquida , Masculino , Ratos , Ratos Sprague-Dawley , Software , Espectrometria de Massas em Tandem
2.
Brief Bioinform ; 13(4): 430-45, 2012 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-22833495

RESUMO

Recent advances in high-throughput biotechnologies have led to the rapid growing research interest in reverse engineering of biomolecular systems (REBMS). 'Data-driven' approaches, i.e. data mining, can be used to extract patterns from large volumes of biochemical data at molecular-level resolution while 'design-driven' approaches, i.e. systems modeling, can be used to simulate emergent system properties. Consequently, both data- and design-driven approaches applied to -omic data may lead to novel insights in reverse engineering biological systems that could not be expected before using low-throughput platforms. However, there exist several challenges in this fast growing field of reverse engineering biomolecular systems: (i) to integrate heterogeneous biochemical data for data mining, (ii) to combine top-down and bottom-up approaches for systems modeling and (iii) to validate system models experimentally. In addition to reviewing progress made by the community and opportunities encountered in addressing these challenges, we explore the emerging field of synthetic biology, which is an exciting approach to validate and analyze theoretical system models directly through experimental synthesis, i.e. analysis-by-synthesis. The ultimate goal is to address the present and future challenges in reverse engineering biomolecular systems (REBMS) using integrated workflow of data mining, systems modeling and synthetic biology.


Assuntos
Mineração de Dados/métodos , Biologia de Sistemas , Bioengenharia/métodos , Biotecnologia
3.
BMC Bioinformatics ; 14 Suppl 11: S8, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24564364

RESUMO

BACKGROUND: Genome annotation is a crucial component of RNA-seq data analysis. Much effort has been devoted to producing an accurate and rational annotation of the human genome. An annotated genome provides a comprehensive catalogue of genomic functional elements. Currently, at least six human genome annotations are publicly available, including AceView Genes, Ensembl Genes, H-InvDB Genes, RefSeq Genes, UCSC Known Genes, and Vega Genes. Characteristics of these annotations differ because of variations in annotation strategies and information sources. When performing RNA-seq data analysis, researchers need to choose a genome annotation. However, the effect of genome annotation choice on downstream RNA-seq expression estimates is still unclear. This study (1) investigates the effect of different genome annotations on RNA-seq quantification and (2) provides guidelines for choosing a genome annotation based on research focus. RESULTS: We define the complexity of human genome annotations in terms of the number of genes, isoforms, and exons. This definition facilitates an investigation of potential relationships between complexity and variations in RNA-seq quantification. We apply several evaluation metrics to demonstrate the impact of genome annotation choice on RNA-seq expression estimates. In the mapping stage, the least complex genome annotation, RefSeq Genes, appears to have the highest percentage of uniquely mapped short sequence reads. In the quantification stage, RefSeq Genes results in the most stable expression estimates in terms of the average coefficient of variation over all genes. Stable expression estimates in the quantification stage translate to accurate statistics for detecting differentially expressed genes. We observe that RefSeq Genes produces the most accurate fold-change measures with respect to a ground truth of RT-qPCR gene expression estimates. CONCLUSIONS: Based on the observed variations in the mapping, quantification, and differential expression calling stages, we demonstrate that the selection of human genome annotation results in different gene expression estimates. When conducting research that emphasizes reproducible and robust gene expression estimates, a less complex genome annotation may be preferred. However, simpler genome annotations may limit opportunities for identifying or characterizing novel transcriptional or regulatory mechanisms. When conducting research that aims to be more exploratory, a more complex genome annotation may be preferred.


Assuntos
Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , RNA/genética , Análise de Sequência de RNA/métodos , Éxons , Genômica/métodos , Humanos , Isoformas de Proteínas/genética
4.
BMC Med Imaging ; 13: 9, 2013 Mar 13.
Artigo em Inglês | MEDLINE | ID: mdl-23497380

RESUMO

BACKGROUND: Automatic cancer diagnostic systems based on histological image classification are important for improving therapeutic decisions. Previous studies propose textural and morphological features for such systems. These features capture patterns in histological images that are useful for both cancer grading and subtyping. However, because many of these features lack a clear biological interpretation, pathologists may be reluctant to adopt these features for clinical diagnosis. METHODS: We examine the utility of biologically interpretable shape-based features for classification of histological renal tumor images. Using Fourier shape descriptors, we extract shape-based features that capture the distribution of stain-enhanced cellular and tissue structures in each image and evaluate these features using a multi-class prediction model. We compare the predictive performance of the shape-based diagnostic model to that of traditional models, i.e., using textural, morphological and topological features. RESULTS: The shape-based model, with an average accuracy of 77%, outperforms or complements traditional models. We identify the most informative shapes for each renal tumor subtype from the top-selected features. Results suggest that these shapes are not only accurate diagnostic features, but also correlate with known biological characteristics of renal tumors. CONCLUSIONS: Shape-based analysis of histological renal tumor images accurately classifies disease subtypes and reveals biologically insightful discriminatory features. This method for shape-based analysis can be extended to other histological datasets to aid pathologists in diagnostic and therapeutic decisions.


Assuntos
Algoritmos , Inteligência Artificial , Biópsia/métodos , Aumento da Imagem/métodos , Interpretação de Imagem Assistida por Computador/métodos , Neoplasias/patologia , Reconhecimento Automatizado de Padrão/métodos , Humanos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
5.
Nanomedicine ; 9(6): 732-6, 2013 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-23751374

RESUMO

Kinases become one of important groups of drug targets. To identify more kinases being potential for cancer therapy, we developed an integrative approach for the large-scale screen of functional genes capable of regulating the main traits of cancer metastasis. We first employed self-assembled cell microarray to screen functional genes that regulate cancer cell migration using a human genome kinase siRNA library. We identified 81 genes capable of significantly regulating cancer cell migration. Following with invasion assays and bio-informatics analysis, we discovered that 16 genes with differentially expression in cancer samples can regulate both cell migration and invasion, among which 10 genes have been well known to play critical roles in the cancer development. The remaining 6 genes were experimentally validated to have the capacities of regulating cell proliferation, apoptosis and anoikis activities besides cell motility. Together, these findings provide a new insight into the therapeutic use of human kinases. FROM THE CLINICAL EDITOR: This team of authors have utilized a self-assembled cell microarray to screen genes that regulate cancer cell migration using a human genome siRNA library of kinases. They validated previously known genes and identified novel ones that may serve as therapeutic targets.


Assuntos
Metástase Neoplásica , Neoplasias/enzimologia , Fosfotransferases/isolamento & purificação , Apoptose/genética , Movimento Celular/genética , Proliferação de Células , Biologia Computacional , Genoma Humano , Células HeLa , Humanos , Invasividade Neoplásica/genética , Neoplasias/patologia , Fosfotransferases/genética , Fosfotransferases/metabolismo , RNA Interferente Pequeno , Análise Serial de Tecidos
6.
BMC Bioinformatics ; 13 Suppl 3: S7, 2012 Mar 21.
Artigo em Inglês | MEDLINE | ID: mdl-22536905

RESUMO

BACKGROUND: Selecting an appropriate classifier for a particular biological application poses a difficult problem for researchers and practitioners alike. In particular, choosing a classifier depends heavily on the features selected. For high-throughput biomedical datasets, feature selection is often a preprocessing step that gives an unfair advantage to the classifiers built with the same modeling assumptions. In this paper, we seek classifiers that are suitable to a particular problem independent of feature selection. We propose a novel measure, called "win percentage", for assessing the suitability of machine classifiers to a particular problem. We define win percentage as the probability a classifier will perform better than its peers on a finite random sample of feature sets, giving each classifier equal opportunity to find suitable features. RESULTS: First, we illustrate the difficulty in evaluating classifiers after feature selection. We show that several classifiers can each perform statistically significantly better than their peers given the right feature set among the top 0.001% of all feature sets. We illustrate the utility of win percentage using synthetic data, and evaluate six classifiers in analyzing eight microarray datasets representing three diseases: breast cancer, multiple myeloma, and neuroblastoma. After initially using all Gaussian gene-pairs, we show that precise estimates of win percentage (within 1%) can be achieved using a smaller random sample of all feature pairs. We show that for these data no single classifier can be considered the best without knowing the feature set. Instead, win percentage captures the non-zero probability that each classifier will outperform its peers based on an empirical estimate of performance. CONCLUSIONS: Fundamentally, we illustrate that the selection of the most suitable classifier (i.e., one that is more likely to perform better than its peers) not only depends on the dataset and application but also on the thoroughness of feature selection. In particular, win percentage provides a single measurement that could assist users in eliminating or selecting classifiers for their particular application.


Assuntos
Algoritmos , Análise de Sequência com Séries de Oligonucleotídeos , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/genética , Humanos , Método de Monte Carlo , Mieloma Múltiplo/diagnóstico , Mieloma Múltiplo/genética , Neuroblastoma/diagnóstico , Neuroblastoma/genética , Distribuição Normal
7.
ScientificWorldJournal ; 2012: 989637, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23365541

RESUMO

Combining multiple microarray datasets increases sample size and leads to improved reproducibility in identification of informative genes and subsequent clinical prediction. Although microarrays have increased the rate of genomic data collection, sample size is still a major issue when identifying informative genetic biomarkers. Because of this, feature selection methods often suffer from false discoveries, resulting in poorly performing predictive models. We develop a simple meta-analysis-based feature selection method that captures the knowledge in each individual dataset and combines the results using a simple rank average. In a comprehensive study that measures robustness in terms of clinical application (i.e., breast, renal, and pancreatic cancer), microarray platform heterogeneity, and classifier (i.e., logistic regression, diagonal LDA, and linear SVM), we compare the rank average meta-analysis method to five other meta-analysis methods. Results indicate that rank average meta-analysis consistently performs well compared to five other meta-analysis methods.


Assuntos
Algoritmos , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Metanálise como Assunto , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Neoplasias da Mama/genética , Feminino , Regulação Neoplásica da Expressão Gênica , Humanos , Neoplasias Renais/genética , Neoplasias Pancreáticas/genética , Receptores de Estrogênio/genética , Reprodutibilidade dos Testes
8.
BMC Bioinformatics ; 12: 383, 2011 Sep 29.
Artigo em Inglês | MEDLINE | ID: mdl-21957981

RESUMO

BACKGROUND: In previous work, we reported the development of caCORRECT, a novel microarray quality control system built to identify and correct spatial artifacts commonly found on Affymetrix arrays. We have made recent improvements to caCORRECT, including the development of a model-based data-replacement strategy and integration with typical microarray workflows via caCORRECT's web portal and caBIG grid services. In this report, we demonstrate that caCORRECT improves the reproducibility and reliability of experimental results across several common Affymetrix microarray platforms. caCORRECT represents an advance over state-of-art quality control methods such as Harshlighting, and acts to improve gene expression calculation techniques such as PLIER, RMA and MAS5.0, because it incorporates spatial information into outlier detection as well as outlier information into probe normalization. The ability of caCORRECT to recover accurate gene expressions from low quality probe intensity data is assessed using a combination of real and synthetic artifacts with PCR follow-up confirmation and the affycomp spike in data. The caCORRECT tool can be accessed at the website: http://cacorrect.bme.gatech.edu. RESULTS: We demonstrate that (1) caCORRECT's artifact-aware normalization avoids the undesirable global data warping that happens when any damaged chips are processed without caCORRECT; (2) When used upstream of RMA, PLIER, or MAS5.0, the data imputation of caCORRECT generally improves the accuracy of microarray gene expression in the presence of artifacts more than using Harshlighting or not using any quality control; (3) Biomarkers selected from artifactual microarray data which have undergone the quality control procedures of caCORRECT are more likely to be reliable, as shown by both spike in and PCR validation experiments. Finally, we present a case study of the use of caCORRECT to reliably identify biomarkers for renal cell carcinoma, yielding two diagnostic biomarkers with potential clinical utility, PRKAB1 and NNMT. CONCLUSIONS: caCORRECT is shown to improve the accuracy of gene expression, and the reproducibility of experimental results in clinical application. This study suggests that caCORRECT will be useful to clean up possible artifacts in new as well as archived microarray data.


Assuntos
Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Artefatos , Carcinoma de Células Renais/genética , Seguimentos , Humanos , Análise de Sequência com Séries de Oligonucleotídeos/normas , Controle de Qualidade , Reprodutibilidade dos Testes
9.
Sci Rep ; 10(1): 17925, 2020 10 21.
Artigo em Inglês | MEDLINE | ID: mdl-33087762

RESUMO

To use next-generation sequencing technology such as RNA-seq for medical and health applications, choosing proper analysis methods for biomarker identification remains a critical challenge for most users. The US Food and Drug Administration (FDA) has led the Sequencing Quality Control (SEQC) project to conduct a comprehensive investigation of 278 representative RNA-seq data analysis pipelines consisting of 13 sequence mapping, three quantification, and seven normalization methods. In this article, we focused on the impact of the joint effects of RNA-seq pipelines on gene expression estimation as well as the downstream prediction of disease outcomes. First, we developed and applied three metrics (i.e., accuracy, precision, and reliability) to quantitatively evaluate each pipeline's performance on gene expression estimation. We then investigated the correlation between the proposed metrics and the downstream prediction performance using two real-world cancer datasets (i.e., SEQC neuroblastoma dataset and the NIH/NCI TCGA lung adenocarcinoma dataset). We found that RNA-seq pipeline components jointly and significantly impacted the accuracy of gene expression estimation, and its impact was extended to the downstream prediction of these cancer outcomes. Specifically, RNA-seq pipelines that produced more accurate, precise, and reliable gene expression estimation tended to perform better in the prediction of disease outcome. In the end, we provided scenarios as guidelines for users to use these three metrics to select sensible RNA-seq pipelines for the improved accuracy, precision, and reliability of gene expression estimation, which lead to the improved downstream gene expression-based prediction of disease outcome.


Assuntos
Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Neoplasias/genética , Análise de Dados , Conjuntos de Dados como Assunto , Humanos , Análise em Microsséries , Valor Preditivo dos Testes , Prognóstico , Controle de Qualidade
10.
Trends Biotechnol ; 27(6): 350-8, 2009 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-19409634

RESUMO

Recent advances in biomarker discovery, biocomputing and nanotechnology have raised new opportunities in the emerging fields of personalized medicine (in which disease detection, diagnosis and therapy are tailored to each individual's molecular profile) and predictive medicine (in which genetic and molecular information is used to predict disease development, progression and clinical outcome). Here, we discuss advanced biocomputing tools for cancer biomarker discovery and multiplexed nanoparticle probes for cancer biomarker profiling, in addition to the prospects for and challenges involved in correlating biomolecular signatures with clinical outcome. This bio-nano-info convergence holds great promise for molecular diagnosis and individualized therapy of cancer and other human diseases.


Assuntos
Biomarcadores Tumorais , Biologia Computacional , Nanotecnologia/métodos , Neoplasias/diagnóstico , Neoplasias/terapia , Protocolos Antineoplásicos , Carcinoma de Células Renais/diagnóstico , Carcinoma de Células Renais/tratamento farmacológico , Carcinoma de Células Renais/terapia , Humanos , Neoplasias Renais/diagnóstico , Neoplasias Renais/tratamento farmacológico , Neoplasias Renais/terapia , Bases de Conhecimento , Neoplasias/tratamento farmacológico
11.
Prog Brain Res ; 158: 83-108, 2006.
Artigo em Inglês | MEDLINE | ID: mdl-17027692

RESUMO

The goal of this chapter is to introduce some of the available computational methods for expression analysis. Genomic and proteomic experimental techniques are briefly discussed to help the reader understand these methods and results better in context with the biological significance. Furthermore, a case study is presented that will illustrate the use of these analytical methods to extract significant biomarkers from high-throughput microarray data. Genomic and proteomic data analysis is essential for understanding the underlying factors that are involved in human disease. Currently, such experimental data are generally obtained by high-throughput microarray or mass spectrometry technologies among others. The sheer amount of raw data obtained using these methods warrants specialized computational methods for data analysis. Biomarker discovery for neurological diagnosis and prognosis is one such example. By extracting significant genomic and proteomic biomarkers in controlled experiments, we come closer to understanding how biological mechanisms contribute to neural degenerative diseases such as Alzheimers' and how drug treatments interact with the nervous system. In the biomarker discovery process, there are several computational methods that must be carefully considered to accurately analyze genomic or proteomic data. These methods include quality control, clustering, classification, feature ranking, and validation. Data quality control and normalization methods reduce technical variability and ensure that discovered biomarkers are statistically significant. Preprocessing steps must be carefully selected since they may adversely affect the results of the following expression analysis steps, which generally fall into two categories: unsupervised and supervised. Unsupervised or clustering methods can be used to group similar genomic or proteomic profiles and therefore can elucidate relationships within sample groups. These methods can also assign biomarkers to sub-groups based on their expression profiles across patient samples. Although clustering is useful for exploratory analysis, it is limited due to its inability to incorporate expert knowledge. On the other hand, classification and feature ranking are supervised, knowledge-based machine learning methods that estimate the distribution of biological expression data and, in doing so, can extract important information about these experiments. Classification is closely coupled with feature ranking, which is essentially a data reduction method that uses classification error estimation or other statistical tests to score features. Biomarkers can subsequently be extracted by eliminating insignificantly ranked features. These analytical methods may be equally applied to genetic and proteomic data. However, because of both biological differences between the data sources and technical differences between the experimental methods used to obtain these data, it is important to have a firm understanding of the data sources and experimental methods. At the same time, regardless of the data quality, it is inevitable that some discovered biomarkers are false positives. Thus, it is important to validate discovered biomarkers. The validation process may be slow; yet, the overall biomarker discovery process is significantly accelerated due to initial feature ranking and data reduction steps. Information obtained from the validation process may also be used to refine data analysis procedures for future iteration. Biomarker validation may be performed in a number of ways - bench-side in traditional labs, web-based electronic resources such as gene ontology and literature databases, and clinical trials.


Assuntos
Biologia Computacional/métodos , Genômica/métodos , Neurociências/métodos , Proteômica/métodos , Animais , Expressão Gênica , Perfilação da Expressão Gênica/métodos , Humanos
12.
Artigo em Inglês | MEDLINE | ID: mdl-32655981

RESUMO

Cancer survival prediction is an active area of research that can help prevent unnecessary therapies and improve patient's quality of life. Gene expression profiling is being widely used in cancer studies to discover informative biomarkers that aid predict different clinical endpoint prediction. We use multiple modalities of data derived from RNA deep-sequencing (RNA-seq) to predict survival of cancer patients. Despite the wealth of information available in expression profiles of cancer tumors, fulfilling the aforementioned objective remains a big challenge, for the most part, due to the paucity of data samples compared to the high dimension of the expression profiles. As such, analysis of transcriptomic data modalities calls for state-of-the-art big-data analytics techniques that can maximally use all the available data to discover the relevant information hidden within a significant amount of noise. In this paper, we propose a pipeline that predicts cancer patients' survival by exploiting the structure of the input (manifold learning) and by leveraging the unlabeled samples using Laplacian support vector machines, a graph-based semi supervised learning (GSSL) paradigm. We show that under certain circumstances, no single modality per se will result in the best accuracy and by fusing different models together via a stacked generalization strategy, we may boost the accuracy synergistically. We apply our approach to two cancer datasets and present promising results. We maintain that a similar pipeline can be used for predictive tasks where labeled samples are expensive to acquire.

13.
Artigo em Inglês | MEDLINE | ID: mdl-27493999

RESUMO

The Big Data era in Biomedical research has resulted in large-cohort data repositories such as The Cancer Genome Atlas (TCGA). These repositories routinely contain hundreds of matched patient samples for genomic, proteomic, imaging, and clinical data modalities, enabling holistic and multi-modal integrative analysis of human disease. Using TCGA renal and ovarian cancer data, we conducted a novel investigation of multi-modal data integration by combining histopathological image and RNA-seq data. We compared the performances of two integrative prediction methods: majority vote and stacked generalization. Results indicate that integration of multiple data modalities improves prediction of cancer grade and outcome. Specifically, stacked generalization, a method that integrates multiple data modalities to produce a single prediction result, outperforms both single-data-modality prediction and majority vote. Moreover, stacked generalization reveals the contribution of each data modality (and specific features within each data modality) to the final prediction result and may provide biological insights to explain prediction performance.

14.
Artigo em Inglês | MEDLINE | ID: mdl-26737772

RESUMO

We compare methods for filtering RNA-seq lowexpression genes and investigate the effect of filtering on detection of differentially expressed genes (DEGs). Although RNA-seq technology has improved the dynamic range of gene expression quantification, low-expression genes may be indistinguishable from sampling noise. The presence of noisy, low-expression genes can decrease the sensitivity of detecting DEGs. Thus, identification and filtering of these low-expression genes may improve DEG detection sensitivity. Using the SEQC benchmark dataset, we investigate the effect of different filtering methods on DEG detection sensitivity. Moreover, we investigate the effect of RNA-seq pipelines on optimal filtering thresholds. Results indicate that the filtering threshold that maximizes the total number of DEGs closely corresponds to the threshold that maximizes DEG detection sensitivity. Transcriptome reference annotation, expression quantification method, and DEG detection method are statistically significant RNA-seq pipeline factors that affect the optimal filtering threshold.


Assuntos
RNA/análise , Análise de Sequência de RNA , Transcriptoma , Encéfalo/metabolismo , Humanos , RNA/química , Reação em Cadeia da Polimerase em Tempo Real
15.
Artigo em Inglês | MEDLINE | ID: mdl-26736237

RESUMO

Prediction of survival for cancer patients is an open area of research. However, many of these studies focus on datasets with a large number of patients. We present a novel method that is specifically designed to address the challenge of data scarcity, which is often the case for cancer datasets. Our method is able to use unlabeled data to improve classification by adopting a semi-supervised training approach to learn an ensemble classifier. The results of applying our method to three cancer datasets show the promise of semi-supervised learning for prediction of cancer survival.


Assuntos
Algoritmos , Bases de Dados Factuais , Neoplasias/mortalidade , Feminino , Humanos , Neoplasias Renais/mortalidade , Neoplasias Ovarianas/mortalidade , Neoplasias Pancreáticas/mortalidade , Prognóstico
16.
ACM BCB ; 2015: 462-471, 2015 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-27583310

RESUMO

While numerous RNA-seq data analysis pipelines are available, research has shown that the choice of pipeline influences the results of differentially expressed gene detection and gene expression estimation. Gene expression estimation is a key step in RNA-seq data analysis, since the accuracy of gene expression estimates profoundly affects the subsequent analysis. Generally, gene expression estimation involves sequence alignment and quantification, and accurate gene expression estimation requires accurate alignment. However, the impact of aligners on gene expression estimation remains unclear. We address this need by constructing nine pipelines consisting of nine spliced aligners and one quantifier. We then use simulated data to investigate the impact of aligners on gene expression estimation. To evaluate alignment, we introduce three alignment performance metrics, (1) the percentage of reads aligned, (2) the percentage of reads aligned with zero mismatch (ZeroMismatchPercentage), and (3) the percentage of reads aligned with at most one mismatch (ZeroOneMismatchPercentage). We then evaluate the impact of alignment performance on gene expression estimation using three metrics, (1) gene detection accuracy, (2) the number of genes falsely quantified (FalseExpNum), and (3) the number of genes with falsely estimated fold changes (FalseFcNum). We found that among various pipelines, FalseExpNum and FalseFcNum are correlated. Moreover, FalseExpNum is linearly correlated with the percentage of reads aligned and ZeroMismatchPercentage, and FalseFcNum is linearly correlated with ZeroMismatchPercentage. Because of this correlation, the percentage of reads aligned and ZeroMismatchPercentage may be used to assess the performance of gene expression estimation for all RNA-seq datasets.

17.
Artigo em Inglês | MEDLINE | ID: mdl-26736365

RESUMO

Histopathological whole-slide images (WSIs) have emerged as an objective and quantitative means for image-based disease diagnosis. However, WSIs may contain acquisition artifacts that affect downstream image feature extraction and quantitative disease diagnosis. We develop a method for detecting blur artifacts in WSIs using distributions of local blur metrics. As features, these distributions enable accurate classification of WSI regions as sharp or blurry. We evaluate our method using over 1000 portions of an endomyocardial biopsy (EMB) WSI. Results indicate that local blur metrics accurately detect blurry image regions.


Assuntos
Coração , Artefatos , Biópsia , Humanos
18.
Genome Biol ; 16: 133, 2015 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-26109056

RESUMO

BACKGROUND: Gene expression profiling is being widely applied in cancer research to identify biomarkers for clinical endpoint prediction. Since RNA-seq provides a powerful tool for transcriptome-based applications beyond the limitations of microarrays, we sought to systematically evaluate the performance of RNA-seq-based and microarray-based classifiers in this MAQC-III/SEQC study for clinical endpoint prediction using neuroblastoma as a model. RESULTS: We generate gene expression profiles from 498 primary neuroblastomas using both RNA-seq and 44 k microarrays. Characterization of the neuroblastoma transcriptome by RNA-seq reveals that more than 48,000 genes and 200,000 transcripts are being expressed in this malignancy. We also find that RNA-seq provides much more detailed information on specific transcript expression patterns in clinico-genetic neuroblastoma subgroups than microarrays. To systematically compare the power of RNA-seq and microarray-based models in predicting clinical endpoints, we divide the cohort randomly into training and validation sets and develop 360 predictive models on six clinical endpoints of varying predictability. Evaluation of factors potentially affecting model performances reveals that prediction accuracies are most strongly influenced by the nature of the clinical endpoint, whereas technological platforms (RNA-seq vs. microarrays), RNA-seq data analysis pipelines, and feature levels (gene vs. transcript vs. exon-junction level) do not significantly affect performances of the models. CONCLUSIONS: We demonstrate that RNA-seq outperforms microarrays in determining the transcriptomic characteristics of cancer, while RNA-seq and microarray-based models perform similarly in clinical endpoint prediction. Our findings may be valuable to guide future studies on the development of gene expression-based predictive models and their implementation in clinical practice.


Assuntos
Perfilação da Expressão Gênica , Neuroblastoma/genética , Análise de Sequência com Séries de Oligonucleotídeos , Análise de Sequência de RNA , Adolescente , Adulto , Criança , Pré-Escolar , Determinação de Ponto Final , Feminino , Humanos , Lactente , Recém-Nascido , Masculino , Modelos Genéticos , Neuroblastoma/classificação , Neuroblastoma/diagnóstico , Células Tumorais Cultivadas , Adulto Jovem
19.
Artigo em Inglês | MEDLINE | ID: mdl-25571173

RESUMO

RNA-seq enables quantification of the human transcriptome. Estimation of gene expression is a fundamental issue in the analysis of RNA-seq data. However, there is an inherent ambiguity in distinguishing between genes with very low expression and experimental or transcriptional noise. We conducted an exploratory investigation of some factors that may affect gene expression calls. We observed that the distribution of reads that map to exonic, intronic, and intergenic regions are distinct. These distributions may provide useful insights into the behavior of gene expression noise. Moreover, we observed that these distributions are qualitatively similar between two sequence mapping algorithms. Finally, we examined the relationship between gene length and gene expression calls, and observed that they are correlated. This preliminary investigation is important for RNA-seq gene expression analysis because it may lead to more effective algorithms for distinguishing between true gene expression and experimental or transcriptional noise.


Assuntos
Perfilação da Expressão Gênica , Análise de Sequência de RNA/métodos , DNA Intergênico/genética , Éxons/genética , Regulação da Expressão Gênica , Humanos , Íntrons/genética , Transcriptoma/genética
20.
ACM BCB ; 2014: 514-523, 2014 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-27532062

RESUMO

Robust prediction models are important for numerous science, engineering, and biomedical applications. However, best-practice procedures for optimizing prediction models can be computationally complex, especially when choosing models from among hundreds or thousands of parameter choices. Computational complexity has further increased with the growth of data in these fields, concurrent with the era of "Big Data". Grid computing is a potential solution to the computational challenges of Big Data. Desktop grid computing, which uses idle CPU cycles of commodity desktop machines, coupled with commercial cloud computing resources can enable research labs to gain easier and more cost effective access to vast computing resources. We have developed omniClassifier, a multi-purpose prediction modeling application that provides researchers with a tool for conducting machine learning research within the guidelines of recommended best-practices. omniClassifier is implemented as a desktop grid computing system using the Berkeley Open Infrastructure for Network Computing (BOINC) middleware. In addition to describing implementation details, we use various gene expression datasets to demonstrate the potential scalability of omniClassifier for efficient and robust Big Data prediction modeling. A prototype of omniClassifier can be accessed at http://omniclassifier.bme.gatech.edu/.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa