RESUMO
Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre for Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. PredictProtein was the first Internet server for protein predictions. It pioneered combining evolutionary information and machine learning. Given a protein sequence as input, the server outputs multiple sequence alignments, predictions of protein structure in 1D and 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, protein flexibility, and disulfide bridges) and predictions of protein function (functional effects of sequence variation or point mutations, Gene Ontology (GO) terms, subcellular localization, and protein-, RNA-, and DNA binding). PredictProtein's infrastructure has moved to the LCSB increasing throughput; the use of MMseqs2 sequence search reduced runtime five-fold (apparently without lowering performance of prediction methods); user interface elements improved usability, and new prediction methods were added. PredictProtein recently included predictions from deep learning embeddings (GO and secondary structure) and a method for the prediction of proteins and residues binding DNA, RNA, or other proteins. PredictProtein.org aspires to provide reliable predictions to computational and experimental biologists alike. All scripts and methods are freely available for offline execution in high-throughput settings.
Assuntos
Conformação Proteica , Software , Sítios de Ligação , Proteínas do Nucleocapsídeo de Coronavírus/química , Proteínas de Ligação a DNA/química , Fosfoproteínas/química , Estrutura Secundária de Proteína , Proteínas/química , Proteínas/fisiologia , Proteínas de Ligação a RNA/química , Alinhamento de Sequência , Análise de Sequência de ProteínaRESUMO
BACKGROUND: Despite the immense importance of transmembrane proteins (TMP) for molecular biology and medicine, experimental 3D structures for TMPs remain about 4-5 times underrepresented compared to non-TMPs. Today's top methods such as AlphaFold2 accurately predict 3D structures for many TMPs, but annotating transmembrane regions remains a limiting step for proteome-wide predictions. RESULTS: Here, we present TMbed, a novel method inputting embeddings from protein Language Models (pLMs, here ProtT5), to predict for each residue one of four classes: transmembrane helix (TMH), transmembrane strand (TMB), signal peptide, or other. TMbed completes predictions for entire proteomes within hours on a single consumer-grade desktop machine at performance levels similar or better than methods, which are using evolutionary information from multiple sequence alignments (MSAs) of protein families. On the per-protein level, TMbed correctly identified 94 ± 8% of the beta barrel TMPs (53 of 57) and 98 ± 1% of the alpha helical TMPs (557 of 571) in a non-redundant data set, at false positive rates well below 1% (erred on 30 of 5654 non-membrane proteins). On the per-segment level, TMbed correctly placed, on average, 9 of 10 transmembrane segments within five residues of the experimental observation. Our method can handle sequences of up to 4200 residues on standard graphics cards used in desktop PCs (e.g., NVIDIA GeForce RTX 3060). CONCLUSIONS: Based on embeddings from pLMs and two novel filters (Gaussian and Viterbi), TMbed predicts alpha helical and beta barrel TMPs at least as accurately as any other method but at lower false positive rates. Given the few false positives and its outstanding speed, TMbed might be ideal to sieve through millions of 3D structures soon to be predicted, e.g., by AlphaFold2.
Assuntos
Idioma , Proteínas de Membrana , Bases de Dados de Proteínas , Proteínas de Membrana/química , Conformação Proteica em alfa-HéliceRESUMO
The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , and PredictProtein.
Assuntos
COVID-19 , SARS-CoV-2 , Algoritmos , Aminoácidos , COVID-19/genética , Humanos , Idioma , Proteoma , SARS-CoV-2/genéticaRESUMO
NLSdb is a database collecting nuclear export signals (NES) and nuclear localization signals (NLS) along with experimentally annotated nuclear and non-nuclear proteins. NES and NLS are short sequence motifs related to protein transport out of and into the nucleus. The updated NLSdb now contains 2253 NLS and introduces 398 NES. The potential sets of novel NES and NLS have been generated by a simple 'in silico mutagenesis' protocol. We started with motifs annotated by experiments. In step 1, we increased specificity such that no known non-nuclear protein matched the refined motif. In step 2, we increased the sensitivity trying to match several different families with a motif. We then iterated over steps 1 and 2. The final set of 2253 NLS motifs matched 35% of 8421 experimentally verified nuclear proteins (up from 21% for the previous version) and none of 18 278 non-nuclear proteins. We updated the web interface providing multiple options to search protein sequences for NES and NLS motifs, and to evaluate your own signal sequences. NLSdb can be accessed via Rostlab services at: https://rostlab.org/services/nlsdb/.
Assuntos
Transporte Ativo do Núcleo Celular/genética , Bases de Dados Genéticas , Anotação de Sequência Molecular , Sinais de Exportação Nuclear/genética , Sinais de Localização Nuclear/química , Interface Usuário-Computador , Sequência de Aminoácidos , Animais , Arabidopsis/genética , Arabidopsis/metabolismo , Caenorhabditis elegans/genética , Caenorhabditis elegans/metabolismo , Núcleo Celular/metabolismo , Conjuntos de Dados como Assunto , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , Células Eucarióticas/metabolismo , Humanos , Internet , Camundongos , Sinais de Localização Nuclear/genética , Sinais de Localização Nuclear/metabolismo , Oryza/genética , Oryza/metabolismo , Ratos , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Schizosaccharomyces/genética , Schizosaccharomyces/metabolismoRESUMO
Motivation: Many applications monitor predictions of a whole range of features for biological datasets, e.g. the fraction of secreted human proteins in the human proteome. Results and error estimates are typically derived from publications. Results: Here, we present a simple, alternative approximation that uses performance estimates of methods to error-correct the predicted distributions. This approximation uses the confusion matrix (TP true positives, TN true negatives, FP false positives and FN false negatives) describing the performance of the prediction tool for correction. As proof-of-principle, the correction was applied to a two-class (membrane/not) and to a seven-class (localization) prediction. Availability and implementation: Datasets and a simple JavaScript tool available freely for all users at http://www.rostlab.org/services/distributions. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Software , Humanos , Proteoma/análiseRESUMO
BACKGROUND AND PURPOSE: Current prognostic models for soft tissue sarcoma (STS) patients are solely based on staging information. Treatment-related data have not been included to date. Including such information, however, could help to improve these models. MATERIALS AND METHODS: A single-center retrospective cohort of 136 STS patients treated with radiotherapy (RT) was analyzed for patients' characteristics, staging information, and treatment-related data. Therapeutic imaging studies and pathology reports of neoadjuvantly treated patients were analyzed for signs of response. Random forest machine learning-based models were used to predict patients' death and disease progression at 2 years. Pre-treatment and treatment models were compared. RESULTS: The prognostic models achieved high performances. Using treatment features improved the overall performance for all three classification types: prediction of death, and of local and systemic progression (area under the receiver operatoring characteristic curve (AUC) of 0.87, 0.88, and 0.84, respectively). Overall, RT-related features, such as the planning target volume and total dose, had preeminent importance for prognostic performance. Therapy response features were selected for prediction of disease progression. CONCLUSIONS: A machine learning-based prognostic model combining known prognostic factors with treatment- and response-related information showed high accuracy for individualized risk assessment. This model could be used for adjustments of follow-up procedures.
Assuntos
Aprendizado de Máquina , Modelos de Riscos Proporcionais , Sarcoma/patologia , Sarcoma/radioterapia , Adulto , Idoso , Idoso de 80 Anos ou mais , Estudos de Coortes , Progressão da Doença , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Terapia Neoadjuvante , Estadiamento de Neoplasias , Prognóstico , Estudos Retrospectivos , Medição de Risco , Sarcoma/mortalidade , Taxa de SobrevidaRESUMO
Transmembrane proteins (TMPs) are important drug targets because they are essential for signaling, regulation, and transport. Despite important breakthroughs, experimental structure determination remains challenging for TMPs. Various methods have bridged the gap by predicting transmembrane helices (TMHs), but room for improvement remains. Here, we present TMSEG, a novel method identifying TMPs and accurately predicting their TMHs and their topology. The method combines machine learning with empirical filters. Testing it on a non-redundant dataset of 41 TMPs and 285 soluble proteins, and applying strict performance measures, TMSEG outperformed the state-of-the-art in our hands. TMSEG correctly distinguished helical TMPs from other proteins with a sensitivity of 98 ± 2% and a false positive rate as low as 3 ± 1%. Individual TMHs were predicted with a precision of 87 ± 3% and recall of 84 ± 3%. Furthermore, in 63 ± 6% of helical TMPs the placement of all TMHs and their inside/outside topology was correctly predicted. There are two main features that distinguish TMSEG from other methods. First, the errors in finding all helical TMPs in an organism are significantly reduced. For example, in human this leads to 200 and 1600 fewer misclassifications compared to the second and third best method available, and 4400 fewer mistakes than by a simple hydrophobicity-based method. Second, TMSEG provides an add-on improvement for any existing method to benefit from. Proteins 2016; 84:1706-1716. © 2016 Wiley Periodicals, Inc.
Assuntos
Aprendizado de Máquina , Proteínas de Membrana/química , Bases de Dados de Proteínas , Conjuntos de Dados como Assunto , Humanos , Interações Hidrofóbicas e Hidrofílicas , Conformação Proteica em alfa-Hélice , Sensibilidade e EspecificidadeRESUMO
PredictProtein is a meta-service for sequence analysis that has been predicting structural and functional features of proteins since 1992. Queried with a protein sequence it returns: multiple sequence alignments, predicted aspects of structure (secondary structure, solvent accessibility, transmembrane helices (TMSEG) and strands, coiled-coil regions, disulfide bonds and disordered regions) and function. The service incorporates analysis methods for the identification of functional regions (ConSurf), homology-based inference of Gene Ontology terms (metastudent), comprehensive subcellular localization prediction (LocTree3), protein-protein binding sites (ISIS2), protein-polynucleotide binding sites (SomeNA) and predictions of the effect of point mutations (non-synonymous SNPs) on protein function (SNAP2). Our goal has always been to develop a system optimized to meet the demands of experimentalists not highly experienced in bioinformatics. To this end, the PredictProtein results are presented as both text and a series of intuitive, interactive and visually appealing figures. The web server and sources are available at http://ppopen.rostlab.org.
Assuntos
Conformação Proteica , Software , Substituição de Aminoácidos , Sítios de Ligação , Ontologia Genética , Internet , Proteínas Intrinsicamente Desordenadas/química , Proteínas de Membrana/química , Mutação , Mapeamento de Interação de Proteínas , Proteínas/análise , Proteínas/genética , Proteínas/metabolismo , Alinhamento de Sequência , Análise de Sequência de Proteína , Homologia de Sequência de AminoácidosRESUMO
The prediction of protein sub-cellular localization is an important step toward elucidating protein function. For each query protein sequence, LocTree2 applies machine learning (profile kernel SVM) to predict the native sub-cellular localization in 18 classes for eukaryotes, in six for bacteria and in three for archaea. The method outputs a score that reflects the reliability of each prediction. LocTree2 has performed on par with or better than any other state-of-the-art method. Here, we report the availability of LocTree3 as a public web server. The server includes the machine learning-based LocTree2 and improves over it through the addition of homology-based inference. Assessed on sequence-unique data, LocTree3 reached an 18-state accuracy Q18=80±3% for eukaryotes and a six-state accuracy Q6=89±4% for bacteria. The server accepts submissions ranging from single protein sequences to entire proteomes. Response time of the unloaded server is about 90 s for a 300-residue eukaryotic protein and a few hours for an entire eukaryotic proteome not considering the generation of the alignments. For over 1000 entirely sequenced organisms, the predictions are directly available as downloads. The web server is available at http://www.rostlab.org/services/loctree3.
Assuntos
Proteínas/análise , Software , Proteínas Arqueais/análise , Inteligência Artificial , Proteínas de Bactérias/análise , Internet , Homologia de Sequência de AminoácidosRESUMO
Experimental structure determination continues to be challenging for membrane proteins. Computational prediction methods are therefore needed and widely used to supplement experimental data. Here, we re-examined the state of the art in transmembrane helix prediction based on a nonredundant dataset with 190 high-resolution structures. Analyzing 12 widely-used and well-known methods using a stringent performance measure, we largely confirmed the expected high level of performance. On the other hand, all methods performed worse for proteins that could not have been used for development. A few results stood out: First, all methods predicted proteins in eukaryotes better than those in bacteria. Second, methods worked less well for proteins with many transmembrane helices. Third, most methods correctly discriminated between soluble and transmembrane proteins. However, several older methods often mistook signal peptides for transmembrane helices. Some newer methods have overcome this shortcoming. In our hands, PolyPhobius and MEMSAT-SVM outperformed other methods.
Assuntos
Biologia Computacional/métodos , Proteínas de Membrana/química , Modelos Estatísticos , Bases de Dados de Proteínas , Humanos , Estrutura Secundária de ProteínaRESUMO
The availability of accurate and fast artificial intelligence (AI) solutions predicting aspects of proteins are revolutionizing experimental and computational molecular biology. The webserver LambdaPP aspires to supersede PredictProtein, the first internet server making AI protein predictions available in 1992. Given a protein sequence as input, LambdaPP provides easily accessible visualizations of protein 3D structure, along with predictions at the protein level (GeneOntology, subcellular location), and the residue level (binding to metal ions, small molecules, and nucleotides; conservation; intrinsic disorder; secondary structure; alpha-helical and beta-barrel transmembrane segments; signal-peptides; variant effect) in seconds. The structure prediction provided by LambdaPP-leveraging ColabFold and computed in minutes-is based on MMseqs2 multiple sequence alignments. All other feature prediction methods are based on the pLM ProtT5. Queried by a protein sequence, LambdaPP computes protein and residue predictions almost instantly for various phenotypes, including 3D structure and aspects of protein function. LambdaPP is freely available for everyone to use under embed.predictprotein.org, the interactive results for the case study can be found under https://embed.predictprotein.org/o/Q9NZC2. The frontend of LambdaPP can be found on GitHub (github.com/sacdallago/embed.predictprotein.org), and can be freely used and distributed under the academic free use license (AFL-2). For high-throughput applications, all methods can be executed locally via the bio-embeddings (bioembeddings.com) python package, or docker image at ghcr.io/bioembeddings/bio_embeddings, which also includes the backend of LambdaPP.
Assuntos
Inteligência Artificial , Proteínas , Proteínas/química , Sequência de Aminoácidos , Estrutura Secundária de Proteína , Alinhamento de Sequência , SoftwareRESUMO
The intricate details of how proteins bind to proteins, DNA, and RNA are crucial for the understanding of almost all biological processes. Disease-causing sequence variants often affect binding residues. Here, we described a new, comprehensive system of in silico methods that take only protein sequence as input to predict binding of protein to DNA, RNA, and other proteins. Firstly, we needed to develop several new methods to predict whether or not proteins bind (per-protein prediction). Secondly, we developed independent methods that predict which residues bind (per-residue). Not requiring three-dimensional information, the system can predict the actual binding residue. The system combined homology-based inference with machine learning and motif-based profile-kernel approaches with word-based (ProtVec) solutions to machine learning protein level predictions. This achieved an overall non-exclusive three-state accuracy of 77% ± 1% (±one standard error) corresponding to a 1.8 fold improvement over random (best classification for protein-protein with F1 = 91 ± 0.8%). Standard neural networks for per-residue binding residue predictions appeared best for DNA-binding (Q2 = 81 ± 0.9%) followed by RNA-binding (Q2 = 80 ± 1%) and worst for protein-protein binding (Q2 = 69 ± 0.8%). The new method, dubbed ProNA2020, is available as code through github (https://github.com/Rostlab/ProNA2020.git) and through PredictProtein (www.predictprotein.org).
Assuntos
Biologia Computacional/métodos , DNA/metabolismo , Redes Neurais de Computação , Proteínas/metabolismo , RNA/metabolismo , Análise de Sequência de Proteína/métodos , Software , Animais , Sítios de Ligação , DNA/química , Eucariotos/metabolismo , Humanos , Aprendizado de Máquina , Conformação de Ácido Nucleico , Células Procarióticas/metabolismo , Ligação Proteica , Conformação Proteica , Proteínas/química , RNA/químicaRESUMO
BACKGROUND: For Glioblastoma (GBM), various prognostic nomograms have been proposed. This study aims to evaluate machine learning models to predict patients' overall survival (OS) and progression-free survival (PFS) on the basis of clinical, pathological, semantic MRI-based, and FET-PET/CT-derived information. Finally, the value of adding treatment features was evaluated. METHODS: One hundred and eighty-nine patients were retrospectively analyzed. We assessed clinical, pathological, and treatment information. The VASARI set of semantic imaging features was determined on MRIs. Metabolic information was retained from preoperative FET-PET/CT images. We generated multiple random survival forest prediction models on a patient training set and performed internal validation. Single feature class models were created including "clinical," "pathological," "MRI-based," and "FET-PET/CT-based" models, as well as combinations. Treatment features were combined with all other features. RESULTS: Of all single feature class models, the MRI-based model had the highest prediction performance on the validation set for OS (C-index: 0.61 [95% confidence interval: 0.51-0.72]) and PFS (C-index: 0.61 [0.50-0.72]). The combination of all features did increase performance above all single feature class models up to C-indices of 0.70 (0.59-0.84) and 0.68 (0.57-0.78) for OS and PFS, respectively. Adding treatment information further increased prognostic performance up to C-indices of 0.73 (0.62-0.84) and 0.71 (0.60-0.81) on the validation set for OS and PFS, respectively, allowing significant stratification of patient groups for OS. CONCLUSIONS: MRI-based features were the most relevant feature class for prognostic assessment. Combining clinical, pathological, and imaging information increased predictive power for OS and PFS. A further increase was achieved by adding treatment features.
Assuntos
Neoplasias Encefálicas/classificação , Glioblastoma/classificação , Aprendizado de Máquina , Modelos Teóricos , Adulto , Idoso , Idoso de 80 Anos ou mais , Neoplasias Encefálicas/diagnóstico por imagem , Neoplasias Encefálicas/patologia , Neoplasias Encefálicas/radioterapia , Quimioterapia Adjuvante , Feminino , Glioblastoma/diagnóstico por imagem , Glioblastoma/patologia , Glioblastoma/radioterapia , Humanos , Imageamento por Ressonância Magnética , Masculino , Pessoa de Meia-Idade , Imagem Multimodal , Tomografia por Emissão de Pósitrons combinada à Tomografia Computadorizada , Prognóstico , Análise de Sobrevida , Adulto JovemRESUMO
PURPOSE: In soft tissue sarcoma (STS) patients systemic progression and survival remain comparably low despite low local recurrence rates. In this work, we investigated whether quantitative imaging features ("radiomics") of radiotherapy planning CT-scans carry a prognostic value for pre-therapeutic risk assessment. METHODS: CT-scans, tumor grade, and clinical information were collected from three independent retrospective cohorts of 83 (TUM), 87 (UW) and 51 (McGill) STS patients, respectively. After manual segmentation and preprocessing, 1358 radiomic features were extracted. Feature reduction and machine learning modeling for the prediction of grading, overall survival (OS), distant (DPFS) and local (LPFS) progression free survival were performed followed by external validation. RESULTS: Radiomic models were able to differentiate grade 3 from non-grade 3 STS (area under the receiver operator characteristic curve (AUC): 0.64). The Radiomic models were able to predict OS (C-index: 0.73), DPFS (C-index: 0.68) and LPFS (C-index: 0.77) in the validation cohort. A combined clinical-radiomics model showed the best prediction for OS (C-index: 0.76). The radiomic scores were significantly associated in univariate and multivariate cox regression and allowed for significant risk stratification for all three endpoints. CONCLUSION: This is the first report demonstrating a prognostic potential and tumor grading differentiation by CT-based radiomics.
Assuntos
Sarcoma/radioterapia , Tomografia Computadorizada por Raios X/métodos , Adulto , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Terapia Neoadjuvante , Gradação de Tumores , Prognóstico , Radiometria , Estudos Retrospectivos , Sarcoma/diagnóstico por imagem , Sarcoma/mortalidade , Sarcoma/patologiaRESUMO
PURPOSE: Noticing the fast growing translation of artificial intelligence (AI) technologies to medical image analysis this paper emphasizes the future role of the medical physicist in this evolving field. Specific challenges are addressed when implementing big data concepts with high-throughput image data processing like radiomics and machine learning in a radiooncology environment to support clinical decisions. METHODS: Based on the experience of our interdisciplinary radiomics working group, techniques for processing minable data, extracting radiomics features and associating this information with clinical, physical and biological data for the development of prediction models are described. A special emphasis was placed on the potential clinical significance of such an approach. RESULTS: Clinical studies demonstrate the role of radiomics analysis as an additional independent source of information with the potential to influence the radiooncology practice, i.e. to predict patient prognosis, treatment response and underlying genetic changes. Extending the radiomics approach to integrate imaging, clinical, genetic and dosimetric data ('panomics') challenges the medical physicist as member of the radiooncology team. CONCLUSIONS: The new field of big data processing in radiooncology offers opportunities to support clinical decisions, to improve predicting treatment outcome and to stimulate fundamental research on radiation response both of tumor and normal tissue. The integration of physical data (e.g. treatment planning, dosimetric, image guidance data) demands an involvement of the medical physicist in the radiomics approach of radiooncology. To cope with this challenge national and international organizations for medical physics should organize more training opportunities in artificial intelligence technologies in radiooncology.
Assuntos
Inteligência Artificial , Diagnóstico por Imagem , Processamento de Imagem Assistida por Computador/métodos , Neoplasias/diagnóstico por imagem , Física , HumanosRESUMO
In the past, short protein-coding genes were often disregarded by genome annotation pipelines. Transcriptome sequencing (RNAseq) signals outside of annotated genes have usually been interpreted to indicate either ncRNA or pervasive transcription. Therefore, in addition to the transcriptome, the translatome (RIBOseq) of the enteric pathogen Escherichia coli O157:H7 strain Sakai was determined at two optimal growth conditions and a severe stress condition combining low temperature and high osmotic pressure. All intergenic open reading frames potentially encoding a protein of ≥ 30 amino acids were investigated with regard to coverage by transcription and translation signals and their translatability expressed by the ribosomal coverage value. This led to discovery of 465 unique, putative novel genes not yet annotated in this E. coli strain, which are evenly distributed over both DNA strands of the genome. For 255 of the novel genes, annotated homologs in other bacteria were found, and a machine-learning algorithm, trained on small protein-coding E. coli genes, predicted that 89% of these translated open reading frames represent bona fide genes. The remaining 210 putative novel genes without annotated homologs were compared to the 255 novel genes with homologs and to 250 short annotated genes of this E. coli strain. All three groups turned out to be similar with respect to their translatability distribution, fractions of differentially regulated genes, secondary structure composition, and the distribution of evolutionary constraint, suggesting that both novel groups represent legitimate genes. However, the machine-learning algorithm only recognized a small fraction of the 210 genes without annotated homologs. It is possible that these genes represent a novel group of genes, which have unusual features dissimilar to the genes of the machine-learning algorithm training set.