RESUMO
In bioinformatics, machine learning methods have been used to predict features embedded in the sequences. In contrast to what is generally assumed, machine learning approaches can also provide new insights into the underlying biology. Here, we demonstrate this by presenting TargetP 2.0, a novel state-of-the-art method to identify N-terminal sorting signals, which direct proteins to the secretory pathway, mitochondria, and chloroplasts or other plastids. By examining the strongest signals from the attention layer in the network, we find that the second residue in the protein, that is, the one following the initial methionine, has a strong influence on the classification. We observe that two-thirds of chloroplast and thylakoid transit peptides have an alanine in position 2, compared with 20% in other plant proteins. We also note that in fungi and single-celled eukaryotes, less than 30% of the targeting peptides have an amino acid that allows the removal of the N-terminal methionine compared with 60% for the proteins without targeting peptide. The importance of this feature for predictions has not been highlighted before.
Assuntos
Biologia Computacional/métodos , Peptídeos/análise , Peptídeos/genética , Sequência de Aminoácidos , Cloroplastos/genética , Cloroplastos/metabolismo , Aprendizado Profundo , Fungos/genética , Fungos/metabolismo , Metionina/metabolismo , Sinais Direcionadores de Proteínas , Tilacoides/genética , Tilacoides/metabolismoRESUMO
RNA sequencing has become widely used in gene expression profiling experiments. Prior to any RNA sequencing experiment the quality of the RNA must be measured to assess whether or not it can be used for further downstream analysis. The RNA integrity number (RIN) is a scale used to measure the quality of RNA that runs from 1 (completely degraded) to 10 (intact). Ideally, samples with high RIN (> 8) are used in RNA sequencing experiments. RNA, however, is a fragile molecule which is susceptible to degradation and obtaining high quality RNA is often hard, or even impossible when extracting RNA from certain clinical tissues. Thus, occasionally, working with low quality RNA is the only option the researcher has. Here we investigate the effects of RIN on RNA sequencing and suggest a computational method to handle data from samples with low quality RNA which also enables reanalysis of published datasets. Using RNA from a human cell line we generated and sequenced samples with varying RINs and illustrate what effect the RIN has on the basic procedure of RNA sequencing; both quality aspects and differential expression. We show that the RIN has systematic effects on gene coverage, false positives in differential expression and the quantification of duplicate reads. We introduce 3' tag counting (3TC) as a computational approach to reliably estimate differential expression for samples with low RIN. We show that using the 3TC method in differential expression analysis significantly reduces false positives when comparing samples with different RIN, while retaining reasonable sensitivity.
Assuntos
Estabilidade de RNA , RNA/química , RNA/genética , Análise de Sequência de RNA/métodos , Linhagem Celular Tumoral , Humanos , TranscriptomaRESUMO
Melanoma of the eye is a rare and distinct subtype of melanoma, which only rarely are familial. However, cases of uveal melanoma (UM) have been found in families with mixed cancer syndromes. Here, we describe a comprehensive search for inherited genetic variation in a family with multiple cases of UM but no aggregation of other cancer diagnoses. The proband is a woman diagnosed with UM at 16 years who within 6 months developed liver metastases. We also identified two older paternal relatives of the proband who had died from UM. We performed exome sequencing of germline DNA from members of the affected family. Exome-wide analysis identified a novel loss-of-function mutation in the BAP1 gene, previously suggested as a tumor suppressor. The mutation segregated with the UM phenotype in this family, and we detected a loss of the wild-type allele in the UM tumor of the proband, strongly supporting a causative association with UM. Screening of BAP1 germline mutations in families predisposed for UM may be used to identify individuals at increased risk of disease. Such individuals may then be enrolled in preventive programs and regular screenings to facilitate early detection and thereby improve prognosis.
Assuntos
Mutação em Linhagem Germinativa , Melanoma/genética , Proteínas Supressoras de Tumor/genética , Ubiquitina Tiolesterase/genética , Neoplasias Uveais/genética , Adolescente , Análise Mutacional de DNA , Saúde da Família , Feminino , Predisposição Genética para Doença/genética , Humanos , Masculino , Melanoma/patologia , Linhagem , Fatores de Risco , Neoplasias Uveais/patologiaRESUMO
Macrophages play a critical role in innate immunity, and the expression of early response genes orchestrate much of the initial response of the immune system. Macrophages undergo extensive transcriptional reprogramming in response to inflammatory stimuli such as Lipopolysaccharide (LPS).To identify gene transcription regulation patterns involved in early innate immune responses, we used two genome-wide approaches--gene expression profiling and chromatin immunoprecipitation-sequencing (ChIP-seq) analysis. We examined the effect of 2 hrs LPS stimulation on early gene expression and its relation to chromatin remodeling (H3 acetylation; H3Ac) and promoter binding of Sp1 and RNA polymerase II phosphorylated at serine 5 (S5P RNAPII), which is a marker for transcriptional initiation. Our results indicate novel and alternative gene regulatory mechanisms for certain proinflammatory genes. We identified two groups of up-regulated inflammatory genes with respect to chromatin modification and promoter features. One group, including highly up-regulated genes such as tumor necrosis factor (TNF), was characterized by H3Ac, high CpG content and lack of TATA boxes. The second group, containing inflammatory mediators (interleukins and CCL chemokines), was up-regulated upon LPS stimulation despite lacking H3Ac in their annotated promoters, which were low in CpG content but did contain TATA boxes. Genome-wide analysis showed that few H3Ac peaks were unique to either +/-LPS condition. However, within these, an unpacking/expansion of already existing H3Ac peaks was observed upon LPS stimulation. In contrast, a significant proportion of S5P RNAPII peaks (approx 40%) was unique to either condition. Furthermore, data indicated a large portion of previously unannotated TSSs, particularly in LPS-stimulated macrophages, where only 28% of unique S5P RNAPII peaks overlap annotated promoters. The regulation of the inflammatory response appears to occur in a very specific manner at the chromatin level for specific genes and this study highlights the level of fine-tuning that occurs in the immune response.
Assuntos
Cromatina/química , Citocinas/metabolismo , Perfilação da Expressão Gênica , Macrófagos/metabolismo , Diferenciação Celular , Imunoprecipitação da Cromatina , Ilhas de CpG , Estudo de Associação Genômica Ampla , Histonas/química , Humanos , Sistema Imunitário , Imunidade Inata , Inflamação/genética , Macrófagos/citologia , Modelos Biológicos , Monócitos/citologia , Família Multigênica , Análise de Sequência com Séries de Oligonucleotídeos , Regiões Promotoras Genéticas , Ligação Proteica , RNA Mensageiro/metabolismo , Serina/químicaRESUMO
BACKGROUND: An interesting field of research in genomics and proteomics is to compare the overlap between the transcriptome and the proteome. Recently, the tools to analyse gene and protein expression on a whole-genome scale have been improved, including the availability of the new generation sequencing instruments and high-throughput antibody-based methods to analyze the presence and localization of proteins. In this study, we used massive transcriptome sequencing (RNA-seq) to investigate the transcriptome of a human osteosarcoma cell line and compared the expression levels with in situ protein data obtained in-situ from antibody-based immunohistochemistry (IHC) and immunofluorescence microscopy (IF). RESULTS: A large-scale analysis based on 2749 genes was performed, corresponding to approximately 13% of the protein coding genes in the human genome. We found the presence of both RNA and proteins to a large fraction of the analyzed genes with 60% of the analyzed human genes detected by all three methods. Only 34 genes (1.2%) were not detected on the transcriptional or protein level with any method. Our data suggest that the majority of the human genes are expressed at detectable transcript or protein levels in this cell line. Since the reliability of antibodies depends on possible cross-reactivity, we compared the RNA and protein data using antibodies with different reliability scores based on various criteria, including Western blot analysis. Gene products detected in all three platforms generally have good antibody validation scores, while those detected only by antibodies, but not by RNA sequencing, generally consist of more low-scoring antibodies. CONCLUSION: This suggests that some antibodies are staining the cells in an unspecific manner, and that assessment of transcript presence by RNA-seq can provide guidance for validation of the corresponding antibodies.
Assuntos
Proteínas de Neoplasias/metabolismo , Osteossarcoma/genética , Osteossarcoma/metabolismo , RNA Mensageiro/genética , Western Blotting , Linhagem Celular Tumoral , Regulação Neoplásica da Expressão Gênica , Genes Neoplásicos/genética , Humanos , Imuno-Histoquímica , Proteínas de Neoplasias/genética , RNA Mensageiro/metabolismo , RNA Neoplásico/genética , RNA Neoplásico/metabolismo , Reprodutibilidade dos Testes , Transcrição GênicaRESUMO
Several recent studies have indicated that transcription is pervasive in regions outside of protein coding genes and that short antisense transcripts can originate from the promoter and terminator regions of genes. Here we investigate transcription of fragments longer than 200 nucleotides, focusing on antisense transcription for known protein coding genes and intergenic transcription. We find that roughly 12% to 16% of all reads that originate from promoter and terminator regions, respectively, map antisense to the gene in question. Furthermore, we detect a high number of novel transcriptionally active regions (TARs) that are generally expressed at a lower level than protein coding genes. We find that the correlation between RNA-seq data and microarray data is dependent on the gene length, with longer genes showing a better correlation. We detect high antisense transcriptional activity from promoter, terminator and intron regions of protein-coding genes and identify a vast number of previously unidentified TARs, including putative novel EGFR transcripts. This shows that in-depth analysis of the transcriptome using RNA-seq is a valuable tool for understanding complex transcriptional events. Furthermore, the development of new algorithms for estimation of gene expression from RNA-seq data is necessary to minimize length bias.
Assuntos
Oligonucleotídeos Antissenso/genética , Transcrição Gênica , Linhagem Celular Tumoral , Receptores ErbB/genética , Regulação Neoplásica da Expressão Gênica , Genoma Humano , Humanos , Íntrons , Modelos Genéticos , Nucleotídeos/química , Análise de Sequência com Séries de Oligonucleotídeos , Oligonucleotídeos Antissenso/químicaRESUMO
Carefully curated proteomes of the inner envelope membrane, the thylakoid membrane, and the thylakoid lumen of chloroplasts from Arabidopsis were assembled based on published, well-documented localizations. These curated proteomes were evaluated for distribution of physical-chemical parameters, with the goal of extracting parameters for improved subcellular prediction and subsequent identification of additional (low abundant) components of each membrane system. The assembly of rigorously curated subcellular proteomes is in itself also important as a parts list for plant and systems biology. Transmembrane and subcellular prediction strategies were evaluated using the curated data sets. The three curated proteomes differ strongly in average isoelectric point and protein size, as well as transmembrane distribution. Removal of the cleavable, N-terminal transit peptide sequences greatly affected isoelectric point and size distribution. Unexpectedly, the Cys content was much lower for the thylakoid proteomes than for the inner envelope. This likely relates to the role of the thylakoid membrane in light-driven electron transport and helps to avoid unwanted oxidation-reduction reactions. A rule of thumb for discriminating between the predicted integral inner envelope membrane and integral thylakoid membrane proteins is suggested. Using a combination of predictors and experimentally derived parameters, four plastid subproteomes were predicted from the fully annotated Arabidopsis genome. These predicted subproteomes were analyzed for their properties and compared to the curated proteomes. The sensitivity and accuracy of the prediction strategies are discussed. Data can be extracted from the new plastid proteome database (http://ppdb.tc.cornell.edu).