RESUMO
Accurate prediction of atomic-level protein structure is important for annotating the biological functions of protein molecules and for designing new compounds to regulate the functions. Template-based modeling (TBM), which aims to construct structural models by copying and refining the structural frameworks of other known proteins, remains the most accurate method for protein structure prediction. Due to the difficulty in recognizing distant-homology templates, however, the accuracy of TBM decreases rapidly when the evolutionary relationship between the query and template vanishes. In this study, we propose a new method, CEthreader, which first predicts residue-residue contacts by coupling evolutionary precision matrices with deep residual convolutional neural-networks. The predicted contact maps are then integrated with sequence profile alignments to recognize structural templates from the PDB. The method was tested on two independent benchmark sets consisting collectively of 1,153 non-homologous protein targets, where CEthreader detected 176% or 36% more correct templates with a TM-score >0.5 than the best state-of-the-art profile- or contact-based threading methods, respectively, for the Hard targets that lacked homologous templates. Moreover, CEthreader was able to identify 114% or 20% more correct templates with the same Fold as the query, after excluding structures from the same SCOPe Superfamily, than the best profile- or contact-based threading methods. Detailed analyses show that the major advantage of CEthreader lies in the efficient coupling of contact maps with profile alignments, which helps recognize global fold of protein structures when the homologous relationship between the query and template is weak. These results demonstrate an efficient new strategy to combine ab initio contact map prediction with profile alignments to significantly improve the accuracy of template-based structure prediction, especially for distant-homology proteins.
Assuntos
Rede Nervosa/fisiologia , Análise de Sequência de Proteína/métodos , Homologia Estrutural de Proteína , Algoritmos , Sequência de Aminoácidos , Biologia Computacional/métodos , Bases de Dados de Proteínas , Modelos Biológicos , Conformação Proteica , Proteínas/química , Alinhamento de Sequência , SoftwareRESUMO
Many computational methods have been proposed to predict essential proteins from protein-protein interaction (PPI) networks. However, it is still challenging to improve the prediction accuracy. In this study, we propose a new method, esPOS (essential proteins Predictor using Order Statistics) to predict essential proteins from PPI networks. Firstly, we refine the networks by using gene expression information and subcellular localization information. Secondly, we design some new features, which combine the protein predicted secondary structure with PPI network. We show that these new features are useful to predict essential proteins. Thirdly, we optimize these features by using a greedy method, and combine the optimized features by order statistic method. Our method achieves the prediction accuracy of 0.76-0.79 on two network datasets. The proposed method is available at https://sourceforge.net/projects/espos/.
Assuntos
Algoritmos , Biologia Computacional/métodos , Mapas de Interação de Proteínas , Estatística como Assunto , Bases de Dados de Proteínas , Valor Preditivo dos TestesRESUMO
In this study, we sequenced the first full-length insect transcriptome using the Erthesina fullo Thunberg based on the PacBio platform. We constructed the first quantitative transcription map of animal mitochondrial genomes and built a straightforward and concise methodology to investigate mitochondrial gene transcription, RNA processing, mRNA maturation and several other related topics. Most of the results were consistent with the previous studies, while to the best of our knowledge some findings were reported for the first time in this study. The new findings included the high levels of mitochondrial gene expression, the 3' polyadenylation and possible 5' m(7)G caps of rRNAs, the isoform diversity of 12S rRNA, the polycistronic transcripts and natural antisense transcripts of mitochondrial genes et al. These findings could challenge and enrich fundamental concepts of mitochondrial gene transcription and RNA processing, particularly of the rRNA primary (sequence) structure. The methodology constructed in this study can also be used to study gene expression or RNA processing of nuclear genomes.
Assuntos
Perfilação da Expressão Gênica , Genes de Insetos , Genes Mitocondriais , Transcriptoma , Animais , Biologia Computacional/métodos , Regulação da Expressão Gênica , Ordem dos Genes , Genoma Mitocondrial , Sequenciamento de Nucleotídeos em Larga Escala , Insetos/genética , Isoformas de RNA , Precursores de RNA/genética , Processamento Pós-Transcricional do RNA , RNA Antissenso , RNA Mensageiro/genética , RNA Ribossômico/genética , Transcrição GênicaRESUMO
MOTIVATION: Off-target interactions of a popular immunosuppressant Cyclosporine A (CSA) with several proteins besides its molecular target, cyclophilin A, are implicated in the activation of signaling pathways that lead to numerous side effects of this drug. RESULTS: Using structural human proteome and a novel algorithm for inverse ligand binding prediction, ILbind, we determined a comprehensive set of 100+ putative partners of CSA. We empirically show that predictive quality of ILbind is better compared with other available predictors for this compound. We linked the putative target proteins, which include many new partners of CSA, with cellular functions, canonical pathways and toxicities that are typical for patients who take this drug. We used complementary approaches (molecular docking, molecular dynamics, surface plasmon resonance binding analysis and enzymatic assays) to validate and characterize three novel CSA targets: calpain 2, caspase 3 and p38 MAP kinase 14. The three targets are involved in the apoptotic pathways, are interconnected and are implicated in nephrotoxicity.
Assuntos
Ciclosporina/química , Imunossupressores/química , Proteômica/métodos , Algoritmos , Calpaína/química , Calpaína/metabolismo , Caspase 3/química , Caspase 3/metabolismo , Ciclosporina/metabolismo , Humanos , Imunossupressores/metabolismo , Proteína Quinase 14 Ativada por Mitógeno/química , Proteína Quinase 14 Ativada por Mitógeno/metabolismo , Simulação de Acoplamento Molecular , Proteoma/química , Transdução de Sinais , Ressonância de Plasmônio de SuperfícieRESUMO
Proteins fold through a two-state (TS), with no visible intermediates, or a multi-state (MS), via at least one intermediate, process. We analyze sequence-derived factors that determine folding types by introducing a novel sequence-based folding type predictor called FOKIT. This method implements a logistic regression model with six input features which hybridize information concerning amino acid composition and predicted secondary structure and solvent accessibility. FOKIT provides predictions with average Matthews correlation coefficient (MCC) between 0.58 and 0.91 measured using out-of-sample tests on four benchmark datasets. These results are shown to be competitive or better than results of four modern predictors. We also show that FOKIT outperforms these methods when predicting chains that share low similarity with the chains used to build the model, which is an important advantage given the limited number of annotated chains. We demonstrate that inclusion of solvent accessibility helps in discrimination of the folding kinetic types and that three of the features constitute statistically significant markers that differentiate TS and MS folders. We found that the increased content of exposed Trp and buried Leu are indicative of the MS folding, which implies that the exposure/burial of certain hydrophobic residues may play important role in the formation of the folding intermediates. Our conclusions are supported by two case studies.
Assuntos
Proteínas/análise , Análise de Sequência de Proteína , Bases de Dados de Proteínas , Cinética , Modelos Logísticos , Dobramento de Proteína , Estrutura Secundária de Proteína , Solventes/químicaRESUMO
The next-generation sequencing coupled with chromatin immunoprecipitation (ChIP-seq) is becoming a key technology for the study of transcriptional regulation in the context of functional genomics. Due to the overwhelming amount of data generated from ChIP-seq experiments, the ChIP-seq data processing brings many new challenges in the field of bioinformatics. Considering the development of data processing skills largely behind that of the ChIP-seq experiment techniques, it is urgent to give a review on the ChIP-seq data processing for more and more oncoming researchers to build or improve algorithms. This paper provides a brief overview of the ChIP-seq data processing, highlighting the main prob-lems and methods in detail, to allow scientists to understand rapidly and deeply.
Assuntos
Imunoprecipitação da Cromatina/métodos , Processamento Eletrônico de Dados/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Animais , Biologia Computacional/métodos , HumanosRESUMO
Background: Coronavirus disease 2019 (COVID-19) is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Although unprecedented efforts are underway to develop therapeutic strategies against this disease, scientists have acquired only a little knowledge regarding the structures and functions of the CoV replication and transcription complex (RTC). Ascertaining all the RTC components and the arrangement of them is an indispensably step for the eventual determination of its global structure, leading to completely understanding all of its functions at the molecular level. Results: The main results include: 1) hairpins containing the canonical and non-canonical NSP15 cleavage motifs are canonical and non-canonical transcription regulatory sequence (TRS) hairpins; 2) TRS hairpins can be used to identify recombination regions in CoV genomes; 3) RNA methylation participates in the determination of the local RNA structures in CoVs by affecting the formation of base pairing; and 4) The eventual determination of the CoV RTC global structure needs to consider METTL3 in the experimental design. Conclusions: In the present study, we proposed the theoretical arrangement of NSP12-15 and METTL3 in the global RTC structure and constructed a model to answer how the RTC functions in the jumping transcription of CoVs. As the most important finding, TRS hairpins were reported for the first time to interpret NSP15 cleavage, RNA methylation of CoVs and their association at the molecular level. Our findings enrich fundamental knowledge in the field of gene expression and its regulation, providing a crucial basis for future studies.
RESUMO
In the present study, we performed precise annotation of Drosophila melanogaster, D. simulans, D. grimshawi, Bactrocera oleae mitochondrial (mt) genomes using pan RNA-seq analysis. Using PacBio cDNA-seq data from D. simulans, we precisely annotated the Transcription Initiation Sites (TISs) of the mt Heavy and Light strands in Drosophila mt genomes and reported that the polyA(+) and polyA(-) motifs in the CRs are associated with TISs. The discovery of the conserved polyA(+) and polyA(-) motifs provides insights into many polyA and polyT sequences in CRs of insect mt genomes, leading to reveal the mt transcription and its regulation in invertebrates. Notably, we propose that: (1) polyA/polyT motifs in CRs function as signals to initiate mtDNA transcription; (2) the duplication, recombination or mutation of these polyA/polyT sequences formed the AT-rich regions during evolution; and (3) since CRs of many invertebrate species still contain many polyA/polyT sequences, there is a high probability that several TISs and TTSs exist in invertebrate mt genomes.
Assuntos
Genoma Mitocondrial , Animais , DNA Mitocondrial/genética , Drosophila/genética , Drosophila melanogaster/genética , Genoma de InsetoRESUMO
Background: Currently, methylotrophic yeasts (e.g., Pichia pastoris, Ogataea polymorpha, and Candida boindii) are subjects of intense genomics studies in basic research and industrial applications. In the genus Ogataea, most research is focused on three basic O. polymorpha strains-CBS4732, NCYC495, and DL-1. However, the relationship between CBS4732, NCYC495, and DL-1 remains unclear, as the genomic differences between them have not be exactly determined without their high-quality complete genomes. As a nutritionally deficient mutant derived from CBS4732, the O. polymorpha strain CBS4732 ura3Δ (named HU-11) is being used for high-yield production of several important proteins or peptides. HU-11 has the same reference genome as CBS4732 (noted as HU-11/CBS4732), because the only genomic difference between them is a 5-bp insertion. Results: In the present study, we have assembled the full-length genome of O. polymorpha HU-11/CBS4732 using high-depth PacBio and Illumina data. Long terminal repeat retrotransposons (LTR-rts), rDNA, 5' and 3' telomeric, subtelomeric, low complexity and other repeat regions were exactly determined to improve the genome quality. In brief, the main findings include complete rDNAs, complete LTR-rts, three large duplicated segments in subtelomeric regions and three structural variations between the HU-11/CBS4732 and NCYC495 genomes. These findings are very important for the assembly of full-length genomes of yeast and the correction of assembly errors in the published genomes of Ogataea spp. HU-11/CBS4732 is so phylogenetically close to NCYC495 that the syntenic regions cover nearly 100% of their genomes. Moreover, HU-11/CBS4732 and NCYC495 share a nucleotide identity of 99.5% through their whole genomes. CBS4732 and NCYC495 can be regarded as the same strain in basic research and industrial applications. Conclusion: The present study preliminarily revealed the relationship between CBS4732, NCYC495, and DL-1. Our findings provide new opportunities for in-depth understanding of genome evolution in methylotrophic yeasts and lay the foundations for the industrial applications of O. polymorpha CBS4732, NCYC495, DL-1, and their derivative strains. The full-length genome of O. polymorpha HU-11/CBS4732 should be included into the NCBI RefSeq database for future studies of Ogataea spp.
RESUMO
To further disclose the underlying mechanisms of protein ß-sheet formation, studies were made on the rules of ß-strands alignment forming ß-sheet structure using statistical and machine learning approaches. Firstly, statistical analysis was performed on the sum of ß-strands between each ß-strand pairs in protein sequences. The results showed a propensity of near-neighbor pairing (or called "first come first pair") in the ß-strand pairs. Secondly, based on the same dataset, the pairwise cross-combinations of real ß-strand pairs and four pseudo-ß-strand contained pairs were classified by support vector machine (SVM). A novel feature extracting approach was designed for classification using the average amino acid pairing encoding matrix (APEM). Analytical results of the classification indicated that a segment of ß-strand had the ability to distinguish ß-strands from segments of α-helix and coil. However, the result also showed that a ß-strand was not strongly conserved to choose its real partner from all the alternative ß-strand partners, which was corresponding with the ordination results of the statistical analysis each other. Thus, the rules of "first come first pair" propensity and the non-conservative ability to choose real partner, were possible important factors affecting the ß-strands alignment forming ß-sheet structures.
Assuntos
Modelos Químicos , Estrutura Secundária de Proteína , Proteínas/química , Algoritmos , Inteligência Artificial , Bases de Dados de Proteínas , Dobramento de Proteína , Alinhamento de SequênciaRESUMO
In December 2019, the world awoke to a new betacoronavirus strain named severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2). Betacoronavirus consists of A, B, C and D subgroups. Both SARS-CoV and SARS-CoV-2 belong to betacoronavirus subgroup B. In the present study, we divided betacoronavirus subgroup B into the SARS1 and SARS2 classes by six key insertions and deletions (InDels) in betacoronavirus genomes, and identified a recently detected betacoronavirus strains RmYN02 as a recombinant strain across the SARS1 and SARS2 classes, which has potential to generate a new strain with similar risk as SARS-CoV and SARS-CoV-2. By analyzing genomic features of betacoronavirus, we concluded: (1) the jumping transcription and recombination of CoVs share the same molecular mechanism, which inevitably causes CoV outbreaks; (2) recombination, receptor binding abilities, junction furin cleavage sites (FCSs), first hairpins and ORF8s are main factors contributing to extraordinary transmission, virulence and host adaptability of betacoronavirus; and (3) the strong recombination ability of CoVs integrated other main factors to generate multiple recombinant strains, two of which evolved into SARS-CoV and SARS-CoV-2, resulting in the SARS and COVID-19 pandemics. As the most important genomic features of SARS-CoV and SARS-CoV-2, an enhanced ORF8 and a novel junction FCS, respectively, are indispensable clues for future studies of their origin and evolution. The WIV1 strain without the enhanced ORF8 and the RaTG13 strain without the junction FCS "RRAR" may contribute to, but are not the immediate ancestors of SARS-CoV and SARS-CoV-2, respectively.
RESUMO
BACKGROUND: Coronavirus disease 2019 (COVID-19) is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Although a preliminary understanding of the replication and transcription of SARS-CoV-2 has recently emerged, their regulation remains unknown. RESULTS: By comprehensive analysis of genome sequence and protein structure data, we propose a negative feedback model to explain the regulation of CoV replication and transcription, providing a molecular basis of the "leader-to-body fusion" model. The key step leading to the proposal of our model was that the transcription regulatory sequence (TRS) motifs were identified as the cleavage sites of nsp15, a nidoviral RNA uridylate-specific endoribonuclease (NendoU). According to this model, nsp15 regulates the synthesis of subgenomic RNAs (sgRNAs), and genomic RNAs (gRNAs) by cleaving TRSs. The expression level of nsp15 controls the relative proportions of sgRNAs and gRNAs, which in turn change the expression level of nsp15 to reach equilibrium between the CoV replication and transcription. CONCLUSION: The replication and transcription of CoVs are regulated by a negative feedback mechanism that influences the persistence of CoVs in hosts. Our findings enrich fundamental knowledge in the field of gene expression and its regulation, and provide new clues for future studies. One important clue is that nsp15 may be an important and ideal target for the development of drugs (e.g., uridine derivatives) against CoVs.
RESUMO
Protein folding rates vary by several orders of magnitude and they depend on the topology of the fold and the size and composition of the sequence. Although recent works show that the rates can be predicted from the sequence, allowing for high-throughput annotations, they consider only the sequence and its predicted secondary structure. We propose a novel sequence-based predictor, PFR-AF, which utilizes solvent accessibility and residue flexibility predicted from the sequence, to improve predictions and provide insights into the folding process. The predictor includes three linear regressions for proteins with two-state, multistate, and unknown (mixed-state) folding kinetics. PFR-AF on average outperforms current methods when tested on three datasets. The proposed approach provides high-quality predictions in the absence of similarity between the predicted and the training sequences. The PFR-AF's predictions are characterized by high (between 0.71 and 0.95, depending on the dataset) correlation and the lowest (between 0.75 and 0.9) mean absolute errors with respect to the experimental rates, as measured using out-of-sample tests. Our models reveal that for the two-state chains inclusion of solvent-exposed Ala may accelerate the folding, while increased content of Ile may reduce the folding speed. We also demonstrate that increased flexibility of coils facilitates faster folding and that proteins with larger content of solvent-exposed strands may fold at a slower pace. The increased flexibility of the solvent-exposed residues is shown to elongate folding, which also holds, with a lower correlation, for buried residues. Two case studies are included to support our findings.
Assuntos
Biologia Computacional/métodos , Dobramento de Proteína , Proteínas , Bases de Dados de Proteínas , Cinética , Modelos Lineares , Conformação Proteica , Proteínas/química , Proteínas/metabolismo , Análise de Sequência de Proteína , SolventesRESUMO
In principle, structural information of protein sequences with no detectable homology to a protein of known structure could be obtained by predicting the arrangement of their secondary structural elements. Although some ab initio methods for protein structure prediction have been reported, the long-range interactions required to accurately predict tertiary structures of beta-sheet containing proteins are still difficult to simulate. To remedy this problem and facilitate de novo prediction of beta-sheet containing protein structures, we developed a support vector machine (SVM) approach that classified parallel and antiparallel orientation of beta-strands by using the information of interstrand amino acid pairing preferences. Based on a second-order statistics on the relative frequencies of each possible interstrand amino acid pair, we defined an average amino acid pairing encoding matrix (APEM) for encoding beta-strands as input in the prediction model. As a result, a prediction accuracy of 86.89% and a Matthew's correlation coefficient value of 0.71 have been achieved through 7-fold cross-validation on a non-redundant protein dataset from PISCES. Although several issues still remain to be studied, the method presented here to some extent could indicate the important contribution of the amino acid pairs to the beta-strand orientation, and provide a possible way to further be combined with other algorithms making a full 'identification' of beta-strands.
Assuntos
Aminoácidos/química , Modelos Teóricos , Proteínas/químicaRESUMO
We investigate the relationship between the flexibility, expressed with B-factor, and the relative solvent accessibility (RSA) in the context of local, with respect to the sequence, neighborhood and related concepts such as residue depth. We observe that the flexibility of a given residue is strongly influenced by the solvent accessibility of the adjacent neighbors. The mean normalized B-factor of the exposed residues with two buried neighbors is smaller than that of the buried residues with two exposed neighbors. Inclusion of RSA of the neighboring residues (local RSA) significantly increases correlation with the B-factor. Correlation between the local RSA and B-factor is shown to be stronger than the correlation that considers local distance- or volume-based residue depth. We also found that the correlation coefficients between B-factor and RSA for the 20 amino acids, called flexibility-exposure correlation index, are strongly correlated with the stability scale that characterizes the average contributions of each amino acid to the folding stability. Our results reveal that the predicted RSA could be used to distinguish between the disordered and ordered residues and that the inclusion of local predicted RSA values helps providing a better contrast between these two types of residues. Prediction models developed based on local actual RSA and local predicted RSA show similar or better results in the context of B-factor and disorder predictions when compared with several existing approaches. We validate our models using three case studies, which show that this work provides useful clues for deciphering the structure-flexibility-function relation.
Assuntos
Proteínas/química , Solventes/química , Sítios de Ligação , Biologia Computacional , Humanos , Modelos Lineares , Estrutura Secundária de ProteínaRESUMO
It is widely considered that it is not appropriate to treat beta-pairs in isolation, since other secondary structural models (such as helices, coils), protein topology and protein tertiary structures would limit beta-strand pairing. However, to understand the underlying mechanisms of beta-sheet formation, studies ought to be performed separately on more concrete aspects. In this study, we focus on the parallel or antiparallel orientation of beta-strands. First, statistical analysis was performed on the relative frequencies of the interstrand amino acid pairs within parallel and antiparallel beta-strands. Consequently, features were extracted by singular value decomposition from the statistical results. By using the support vector machine to distinguish the features extracted from the two types of beta-strands, high accuracy was achieved (up to 99.4%). This suggests that the interstrand amino acid pairs play a significant role in determining the parallel or antiparallel orientation of beta-strands. These results may provide useful information for developing other useful algorithms to examine to the beta-strand folding pathways, and could eventually lead to protein structure predictions.
Assuntos
Aminoácidos/química , Modelos Químicos , Estrutura Secundária de Proteína , Sequência de Aminoácidos , Interpretação Estatística de DadosRESUMO
MOTIVATION: Prediction of catalytic residues provides useful information for the research on function of enzymes. Most of the existing prediction methods are based on structural information, which limits their use. We propose a sequence-based catalytic residue predictor that provides predictions with quality comparable to modern structure-based methods and that exceeds quality of state-of-the-art sequence-based methods. RESULTS: Our method (CRpred) uses sequence-based features and the sequence-derived PSI-BLAST profile. We used feature selection to reduce the dimensionality of the input (and explain the input) to support vector machine (SVM) classifier that provides predictions. Tests on eight datasets and side-by-side comparison with six modern structure- and sequence-based predictors show that CRpred provides predictions with quality comparable to current structure-based methods and better than sequence-based methods. The proposed method obtains 15-19% precision and 48-58% TP (true positive) rate, depending on the dataset used. CRpred also provides confidence values that allow selecting a subset of predictions with higher precision. The improved quality is due to newly designed features and careful parameterization of the SVM. The features incorporate amino acids characterized by the highest and the lowest propensities to constitute catalytic residues, Gly that provides flexibility for catalytic sites and sequence motifs characteristic to certain catalytic reactions. Our features indicate that catalytic residues are on average more conserved when compared with the general population of residues and that highly conserved amino acids characterized by high catalytic propensity are likely to form catalytic sites. We also show that local (with respect to the sequence) hydrophobicity contributes towards the prediction.
Assuntos
Aminoácidos/química , Domínio Catalítico , Análise de Sequência de Proteína/métodos , Algoritmos , Sequência de Aminoácidos , Catálise , Biologia Computacional/métodos , Bases de Dados de Proteínas , Interações Hidrofóbicas e Hidrofílicas , Dobramento de ProteínaRESUMO
Disease gene prediction is a challenging task that has a variety of applications such as early diagnosis and drug development. The existing machine learning methods suffer from the imbalanced sample issue because the number of known disease genes (positive samples) is much less than that of unknown genes which are typically considered to be negative samples. In addition, most methods have not utilized clinical data from patients with a specific disease to predict disease genes. In this study, we propose a disease gene prediction algorithm (called dgSeq) by combining protein-protein interaction (PPI) network, clinical RNA-Seq data, and Online Mendelian Inheritance in Man (OMIN) data. Our dgSeq constructs differential networks based on rewiring information calculated from clinical RNA-Seq data. To select balanced sets of non-disease genes (negative samples), a disease-gene network is also constructed from OMIM data. After features are extracted from the PPI networks and differential networks, the logistic regression classifiers are trained. Our dgSeq obtains AUC values of 0.88, 0.83, and 0.80 for identifying breast cancer genes, thyroid cancer genes, and Alzheimer's disease genes, respectively, which indicates its superiority to other three competing methods. Both gene set enrichment analysis and predicted results demonstrate that dgSeq can effectively predict new disease genes.
Assuntos
Biologia Computacional/métodos , Neoplasias , Mapas de Interação de Proteínas/genética , RNA/genética , Bases de Dados Genéticas , Humanos , Neoplasias/classificação , Neoplasias/genética , Neoplasias/metabolismo , RNA/metabolismo , Curva ROC , Análise de Sequência de RNA/métodosRESUMO
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.
RESUMO
In this study, we used pan RNA-seq analysis to reveal the ubiquitous existence of both 5' and 3' end small RNAs (5' and 3' sRNAs). 5' and 3' sRNAs alone can be used to annotate nuclear non-coding and mitochondrial genes at 1-bp resolution and identify new steady RNAs, which are usually transcribed from functional genes. Then, we provided a simple and cost effective way for the annotation of nuclear non-coding and mitochondrial genes and the identification of new steady RNAs, particularly long non-coding RNAs (lncRNAs). Using 5' and 3' sRNAs, the annotation of human mitochondrial was corrected and a novel ncRNA named non-coding mitochondrial RNA 1 (ncMT1) was reported for the first time in this study. We also found that most of human tRNA genes have downstream lncRNA genes as lncTRS-TGA1-1 and corrected the misunderstanding of them in previous studies. Using 5', 3', and intronic sRNAs, we reported for the first time that enzymatic double-stranded RNA (dsRNA) cleavage and RNA interference (RNAi) might be involved in the RNA degradation and gene expression regulation of U1 snRNA in human. We provided a different perspective on the regulation of gene expression in U1 snRNA. We also provided a novel view on cancer and virus-induced diseases, leading to find diagnostics or therapy targets from the ribonuclease III (RNase III) family and its related pathways. Our findings pave the way toward a rediscovery of dsRNA cleavage and RNAi, challenging classical theories.