RESUMO
De novo variants represent a significant cause of neurodevelopmental delay and intellectual disability. A genetic basis can be identified in only half of individuals who have neurodevelopmental disorders (NDDs); this indicates that additional causes need to be elucidated. We compared the frequency of de novo variants in patient-parent trios with (n = 2,030) versus without (n = 2,755) NDDs. We identified de novo variants in TAOK1 (thousand and one [TAO] amino acid kinase 1), which encodes the serine/threonine-protein kinase TAO1, in three individuals with NDDs but not in persons who did not have NDDs. Through further screening and the use of GeneMatcher, five additional individuals with NDDs were found to have de novo variants. All eight variants were absent from gnomAD (Genome Aggregation Database). The variant carriers shared a non-specific phenotype of developmental delay, and six individuals had additional muscular hypotonia. We established a fibroblast line of one mutation carrier, and we demonstrated that reduced mRNA levels of TAOK1 could be increased upon cycloheximide treatment. These results indicate nonsense-mediated mRNA decay. Further, there was neither detectable phosphorylated TAO1 kinase nor phosphorylated tau in these cells, and mitochondrial morphology was altered. Knockdown of the ortholog gene Tao1 (Tao, CG14217) in Drosophila resulted in delayed early development. The majority of the Tao1-knockdown flies did not survive beyond the third instar larval stage. When compared to control flies, Tao1 knockdown flies revealed changed morphology of the ventral nerve cord and the neuromuscular junctions as well as a decreased number of endings (boutons). Furthermore, mitochondria in mutant flies showed altered distribution and decreased size in axons of motor neurons. Thus, we provide compelling evidence that de novo variants in TAOK1 cause NDDs.
Assuntos
Drosophila melanogaster/crescimento & desenvolvimento , Exoma/genética , Mutação , Transtornos do Neurodesenvolvimento/etiologia , Proteínas Serina-Treonina Quinases/genética , Animais , Criança , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , Feminino , Heterozigoto , Humanos , Masculino , Transtornos do Neurodesenvolvimento/patologia , Fenótipo , Sequenciamento do ExomaRESUMO
PURPOSE: Next-generation sequencing (NGS) is rapidly replacing Sanger sequencing in genetic diagnostics. Sensitivity and specificity of NGS approaches are not well-defined, but can be estimated from applying NGS and Sanger sequencing in parallel. Utilizing this strategy, we aimed at optimizing exome sequencing (ES)-based diagnostics of a clinically diverse patient population. METHODS: Consecutive DNA samples from unrelated patients with suspected genetic disease were exome-sequenced; comparatively nonstringent criteria were applied in variant calling. One thousand forty-eight variants in genes compatible with the clinical diagnosis were followed up by Sanger sequencing. Based on a set of variant-specific features, predictors for true positives and true negatives were developed. RESULTS: Sanger sequencing confirmed 81.9% of ES-derived variants. Calls from the lower end of stringency accounted for the majority of the false positives, but also contained ~5% of the true positives. A predictor incorporating three variant-specific features classified 91.7% of variants with 100% specificity and 99.75% sensitivity. Confirmation status of the remaining variants (8.3%) was not predictable. CONCLUSIONS: Criteria for variant calling in ES-based diagnostics impact on specificity and sensitivity. Confirmatory sequencing for a proportion of variants, therefore, remains a necessity. Our study exemplifies how these variants can be defined on an empirical basis.
Assuntos
Sequenciamento do Exoma , Exoma/genética , Doenças Genéticas Inatas/genética , Doenças Genéticas Inatas/diagnóstico , Doenças Genéticas Inatas/patologia , Variação Genética/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNARESUMO
PURPOSE: Skeletal muscle growth and regeneration rely on muscle stem cells, called satellite cells. Specific transcription factors, particularly PAX7, are key regulators of the function of these cells. Knockout of this factor in mice leads to poor postnatal survival; however, the consequences of a lack of PAX7 in humans have not been established. METHODS: Here, we study five individuals with myopathy of variable severity from four unrelated consanguineous couples. Exome sequencing identified pathogenic variants in the PAX7 gene. Clinical examination, laboratory tests, and muscle biopsies were performed to characterize the disease. RESULTS: The disease was characterized by hypotonia, ptosis, muscular atrophy, scoliosis, and mildly dysmorphic facial features. The disease spectrum ranged from mild to severe and appears to be progressive. Muscle biopsies showed the presence of atrophic fibers and fibroadipose tissue replacement, with the absence of myofiber necrosis. A lack of PAX7 expression was associated with satellite cell pool exhaustion; however, the presence of residual myoblasts together with regenerating myofibers suggest that a population of PAX7-independent myogenic cells partially contributes to muscle regeneration. CONCLUSION: These findings show that biallelic variants in the master transcription factor PAX7 cause a new type of myopathy that specifically affects satellite cell survival.
Assuntos
Doenças Musculares/genética , Fator de Transcrição PAX7/genética , Adolescente , Alelos , Criança , Pré-Escolar , Feminino , Humanos , Masculino , Desenvolvimento Muscular , Músculo Esquelético/metabolismo , Doenças Musculares/etiologia , Mioblastos , Fator de Transcrição PAX7/metabolismo , Linhagem , Regeneração , Células Satélites de Músculo Esquelético/metabolismo , Fatores de Transcrição/genética , Sequenciamento do Exoma/métodosRESUMO
The extracellular matrix (ECM) is a major component of tissues of multicellular organisms. It consists of secreted macromolecules, mainly polysaccharides and glycoproteins. Malfunctions of ECM proteins lead to severe disorders such as marfan syndrome, osteogenesis imperfecta, numerous chondrodysplasias, and skin diseases. In this work, we report a random forest approach, EcmPred, for the prediction of ECM proteins from protein sequences. EcmPred was trained on a dataset containing 300 ECM and 300 non-ECM and tested on a dataset containing 145 ECM and 4187 non-ECM proteins. EcmPred achieved 83% accuracy on the training and 77% on the test dataset. EcmPred predicted 15 out of 20 experimentally verified ECM proteins. By scanning the entire human proteome, we predicted novel ECM proteins validated with gene ontology and InterPro. The dataset and standalone version of the EcmPred software is available at http://www.inb.uni-luebeck.de/tools-demos/Extracellular_matrix_proteins/EcmPred.
Assuntos
Algoritmos , Biologia Computacional/métodos , Proteínas da Matriz Extracelular/metabolismo , Inteligência Artificial , Bases de Dados de Proteínas , Humanos , Proteoma/metabolismo , Curva ROCRESUMO
Niemann-Pick type C1 disease (NPC1 [OMIM 257220]) is a rare and severe autosomal recessive disorder, characterized by a multitude of neurovisceral clinical manifestations and a fatal outcome with no effective treatment to date. Aiming to gain insights into the genetic aspects of the disease, clinical, genetic, and biomarker PPCS data from 602 patients referred from 47 countries and diagnosed with NPC1 in our laboratory were analyzed. Patients' clinical data were dissected using Human Phenotype Ontology (HPO) terms, and genotype-phenotype analysis was performed. The median age at diagnosis was 10.6 years (range 0-64.5 years), with 287 unique pathogenic/likely pathogenic (P/LP) variants identified, expanding NPC1 allelic heterogeneity. Importantly, 73 P/LP variants were previously unpublished. The most frequent variants detected were: c.3019C > G, p.(P1007A), c.3104C > T, p.(A1035V), and c.2861C > T, p.(S954L). Loss of function (LoF) variants were significantly associated with earlier age at diagnosis, highly increased biomarker levels, and a visceral phenotype (abnormal abdomen and liver morphology). On the other hand, the variants p.(P1007A) and p.(S954L) were significantly associated with later age at diagnosis (p < 0.001) and mildly elevated biomarker levels (p ≤ 0.002), consistent with the juvenile/adult form of NPC1. In addition, p.(I1061T), p.(S954L), and p.(A1035V) were associated with abnormality of eye movements (vertical supranuclear gaze palsy, p ≤ 0.05). We describe the largest and most heterogenous cohort of NPC1 patients published to date. Our results suggest that besides its utility in variant classification, the biomarker PPCS might serve to indicate disease severity/progression. In addition, we establish new genotype-phenotype relationships for "frequent" NPC1 variants.
Assuntos
Fenótipo , Adulto , Humanos , Recém-Nascido , Lactente , Pré-Escolar , Criança , Adolescente , Adulto Jovem , Pessoa de Meia-IdadeRESUMO
BACKGROUND: Bioluminescence is a process in which light is emitted by a living organism. Most creatures that emit light are sea creatures, but some insects, plants, fungi etc, also emit light. The biotechnological application of bioluminescence has become routine and is considered essential for many medical and general technological advances. Identification of bioluminescent proteins is more challenging due to their poor similarity in sequence. So far, no specific method has been reported to identify bioluminescent proteins from primary sequence. RESULTS: In this paper, we propose a novel predictive method that uses a Support Vector Machine (SVM) and physicochemical properties to predict bioluminescent proteins. BLProt was trained using a dataset consisting of 300 bioluminescent proteins and 300 non-bioluminescent proteins, and evaluated by an independent set of 141 bioluminescent proteins and 18202 non-bioluminescent proteins. To identify the most prominent features, we carried out feature selection with three different filter approaches, ReliefF, infogain, and mRMR. We selected five different feature subsets by decreasing the number of features, and the performance of each feature subset was evaluated. CONCLUSION: BLProt achieves 80% accuracy from training (5 fold cross-validations) and 80.06% accuracy from testing. The performance of BLProt was compared with BLAST and HMM. High prediction accuracy and successful prediction of hypothetical proteins suggests that BLProt can be a useful approach to identify bioluminescent proteins from sequence information, irrespective of their sequence similarity. The BLProt software is available at http://www.inb.uni-luebeck.de/tools-demos/bioluminescent%20protein/BLProt.
Assuntos
Proteínas Luminescentes/química , Software , Máquina de Vetores de Suporte , Animais , Humanos , Cadeias de MarkovRESUMO
Some creatures living in extremely low temperatures can produce some special materials called "antifreeze proteins" (AFPs), which can prevent the cell and body fluids from freezing. AFPs are present in vertebrates, invertebrates, plants, bacteria, fungi, etc. Although AFPs have a common function, they show a high degree of diversity in sequences and structures. Therefore, sequence similarity based search methods often fails to predict AFPs from sequence databases. In this work, we report a random forest approach "AFP-Pred" for the prediction of antifreeze proteins from protein sequence. AFP-Pred was trained on the dataset containing 300 AFPs and 300 non-AFPs and tested on the dataset containing 181 AFPs and 9193 non-AFPs. AFP-Pred achieved 81.33% accuracy from training and 83.38% from testing. The performance of AFP-Pred was compared with BLAST and HMM. High prediction accuracy and successful of prediction of hypothetical proteins suggests that AFP-Pred can be a useful approach to identify antifreeze proteins from sequence information, irrespective of their sequence similarity.
Assuntos
Algoritmos , Sequência de Aminoácidos/genética , Proteínas Anticongelantes/análise , Biologia Computacional/métodos , Proteínas/classificação , Aminoácidos/química , Proteínas Anticongelantes/genética , Inteligência Artificial , Fenômenos Químicos , Estrutura Secundária de Proteína/genética , Estrutura Terciária de Proteína/genética , Proteínas/genética , Curva ROCRESUMO
Eukaryotic protein secretion generally occurs via the classical secretory pathway that traverses the ER and Golgi apparatus. Secreted proteins usually contain a signal sequence with all the essential information required to target them for secretion. However, some proteins like fibroblast growth factors (FGF-1, FGF-2), interleukins (IL-1 alpha, IL-1 beta), galectins and thioredoxin are exported by an alternative pathway. This is known as leaderless or non-classical secretion and works without a signal sequence. Most computational methods for the identification of secretory proteins use the signal peptide as indicator and are therefore not able to identify substrates of non-classical secretion. In this work, we report a random forest method, SPRED, to identify secretory proteins from protein sequences irrespective of N-terminal signal peptides, thus allowing also correct classification of non-classical secretory proteins. Training was performed on a dataset containing 600 extracellular proteins and 600 cytoplasmic and/or nuclear proteins. The algorithm was tested on 180 extracellular proteins and 1380 cytoplasmic and/or nuclear proteins. We obtained 85.92% accuracy from training and 82.18% accuracy from testing. Since SPRED does not use N-terminal signals, it can detect non-classical secreted proteins by filtering those secreted proteins with an N-terminal signal by using SignalP. SPRED predicted 15 out of 19 experimentally verified non-classical secretory proteins. By scanning the entire human proteome we identified 566 protein sequences potentially undergoing non-classical secretion. The dataset and standalone version of the SPRED software is available at http://www.inb.uni-luebeck.de/tools-demos/spred/spred.
Assuntos
Inteligência Artificial , Genoma Humano , Proteínas/metabolismo , Proteoma , Análise de Sequência de Proteína/métodos , Animais , Humanos , Proteínas/química , Proteínas/genéticaRESUMO
Lipocalins are functionally diverse proteins that are composed of 120-180 amino acid residues. Members of this family have several important biological functions including ligand transport, cryptic coloration, sensory transduction, endonuclease activity, stress response activity in plants, odorant binding, prostaglandin biosynthesis, cellular homeostasis regulation, immunity, immunotherapy and so on. Identification of lipocalins from protein sequence is more challenging due to the poor sequence identity which often falls below the twilight zone. So far, no specific method has been reported to identify lipocalins from primary sequence. In this paper, we report a support vector machine (SVM) approach to predict lipocalins from protein sequence using sequence-derived properties. LipoPred was trained using a dataset consisting of 325 lipocalin proteins and 325 non-lipocalin proteins, and evaluated by an independent set of 140 lipocalin proteins and 21,447 non-lipocalin proteins. LipoPred achieved 88.61% accuracy with 89.26% sensitivity, 85.27% specificity and 0.74 Matthew's correlation coefficient (MCC). When applied on the test dataset, LipoPred achieved 84.25% accuracy with 88.57% sensitivity, 84.22% specificity and MCC of 0.16. LipoPred achieved better performance rate when compared with PSI-BLAST, HMM and SVM-Prot methods. Out of 218 lipocalins, LipoPred correctly predicted 194 proteins including 39 lipocalins that are non-homologous to any protein in the SWISSPROT database. This result shows that LipoPred is potentially useful for predicting the lipocalin proteins that have no sequence homologs in the sequence databases. Further, successful prediction of nine hypothetical lipocalin proteins and five new members of lipocalin family prove that LipoPred can be efficiently used to identify and annotate the new lipocalin proteins from sequence databases. The LipoPred software and dataset are available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/lipopred.htm.
Assuntos
Lipocalinas/química , Alinhamento de Sequência/métodos , Bases de Dados de Proteínas , Humanos , Estrutura Terciária de Proteína , Alinhamento de Sequência/instrumentação , Homologia de Sequência de AminoácidosRESUMO
Neurometabolic disorders are often inherited and complex disorders that result from abnormalities of enzymes important for development and function of the nervous system. Recently, biallelic mutations in NAXE (APOA1BP) were found in patients with an infantile, lethal, neurometabolic disease. Here, exome sequencing was performed in two affected sisters and their healthy parents. The best candidate, NAXE, was tested for replication in exome sequencing data from 4351 patients with neurodevelopmental disorders. Quantitative RT-PCR, western blot and form factor analysis were performed to assess NAXE expression, protein levels and to analyze mitochondrial morphology in fibroblasts. Vitamin B3 was administered to one patient. Compound heterozygous missense (c.757G>A: p.Gly253Ser) and splicing (c.665-1G>A) variants in NAXE were identified in both affected sisters. In contrast to the previously reported patients with biallelic NAXE variants, our patients showed a milder phenotype with disease onset in early adulthood with psychosis, cognitive impairment, seizures, cerebellar ataxia and spasticity. The symptoms fluctuated. Additional screening of NAXE identified three novel homozygous missense variants (p.Lys245Gln, p.Asp218Asn, p.Ile214Val) in three patients with overlapping phenotype (fluctuating disease course, respiratory insufficiency, movement disorder). Lastly, patients with the c.665-1G>A splicing variant showed a significant reduction of NAXE expression compared to control fibroblasts and undetectable NAXE protein levels compared to control fibroblasts. Based on the metabolic pathway, vitamin B3 and coenzyme Q treatment was introduced in one patient in addition to antiepileptic treatment. This combination and avoidance of triggers was associated with continuous motor and cognitive improvement. The NAXE variants identified in this study suggest a loss-of-function mechanism leading to an insufficient NAD(P)HX repair system. Importantly, symptoms of patients with NAXE variants may improve with vitamin B3/coenzyme Q administration.
Assuntos
Encefalopatias Metabólicas Congênitas/genética , Racemases e Epimerases/genética , Encefalopatias Metabólicas Congênitas/tratamento farmacológico , Feminino , Humanos , Masculino , Mutação de Sentido Incorreto , Transtornos do Neurodesenvolvimento/genética , Niacinamida/uso terapêutico , Linhagem , Ubiquinona/análogos & derivados , Ubiquinona/uso terapêutico , Adulto JovemRESUMO
BACKGROUND: Rare denovo variants represent a significant cause of neurodevelopmental delay and intellectual disability (ID). METHODS: Exome sequencing was performed on 4351 patients with global developmental delay, seizures, microcephaly, macrocephaly, motor delay, delayed speech and language development, or ID according to Human Phenotype Ontology (HPO) terms. All patients had previously undergone whole exome sequencing as part of diagnostic genetic testing with a focus on variants in genes implicated in neurodevelopmental disorders up to January 2017. This resulted in a genetic diagnosis in 1336 of the patients. In this study, we specifically searched for variants in 14 recently implicated novel neurodevelopmental disorder (NDD) genes. RESULTS: We identified 65 rare, protein-changing variants in 11 of these 14 novel candidate genes. Fourteen variants in CDK13, CHD4, KCNQ3, KMT5B, TCF20, and ZBTB18 were scored pathogenic or likely pathogenic. Of note, two of these patients had a previously identified cause of their disease, and thus, multiple molecular diagnoses were made including pathogenic/likely pathogenic variants in FOXG1 and CDK13 or in TMEM237 and KMT5B. CONCLUSIONS: Looking for pathogenic variants in newly identified NDD genes enabled us to provide a molecular diagnosis to 14 patients and their close relatives and caregivers. This underlines the relevance of re-evaluation of existing exome data on a regular basis to improve the diagnostic yield and serve the needs of our patients.
Assuntos
Sequenciamento do Exoma , Testes Genéticos , Transtornos do Neurodesenvolvimento/diagnóstico , Transtornos do Neurodesenvolvimento/genética , Adolescente , Ontologias Biológicas , Criança , Pré-Escolar , Feminino , Humanos , Masculino , FenótipoRESUMO
Genetic testing for cystic fibrosis and CFTR-related disorders mostly relies on laborious molecular tools that use Sanger sequencing to scan for mutations in the CFTR gene. We have explored a more efficient genetic screening strategy based on next-generation sequencing (NGS) of the CFTR gene. We validated this approach in a cohort of 177 patients with previously known CFTR mutations and polymorphisms. Genomic DNA was amplified using the Ion AmpliSeq™ CFTR panel. The DNA libraries were pooled, barcoded, and sequenced using an Ion Torrent PGM sequencer. The combination of different robust bioinformatics tools allowed us to detect previously known pathogenic mutations and polymorphisms in the 177 samples, without detecting spurious pathogenic calls. In summary, the assay achieves a sensitivity of 94.45% (95% CI: 92% to 96.9%), with a specificity of detecting nonvariant sites from the CFTR reference sequence of 100% (95% CI: 100% to 100%), a positive predictive value of 100% (95% CI: 100% to 100%), and a negative predictive value of 99.99% (95% CI: 99.99% to 100%). In addition, we describe the observed allelic frequencies of 94 unique definitely and likely pathogenic, uncertain, and neutral CFTR variants, some of them not previously annotated in the public databases. Strikingly, a seven exon spanning deletion as well as several more technically challenging variants such as pathogenic poly-thymidine-guanine and poly-thymidine (poly-TG-T) tracts were also detected. Targeted NGS is ready to substitute classical molecular methods to perform genetic testing on the CFTR gene.
RESUMO
Prediction of protein structure from its amino acid sequence is still a challenging problem. The complete physicochemical understanding of protein folding is essential for the accurate structure prediction. Knowledge of residue solvent accessibility gives useful insights into protein structure prediction and function prediction. In this work, we propose a random forest method, RSARF, to predict residue accessible surface area from protein sequence information. The training and testing was performed using 120 proteins containing 22006 residues. For each residue, buried and exposed state was computed using five thresholds (0%, 5%, 10%, 25%, and 50%). The prediction accuracy for 0%, 5%, 10%, 25%, and 50% thresholds are 72.9%, 78.25%, 78.12%, 77.57% and 72.07% respectively. Further, comparison of RSARF with other methods using a benchmark dataset containing 20 proteins shows that our approach is useful for prediction of residue solvent accessibility from protein sequence without using structural information. The RSARF program, datasets and supplementary data are available at http://caps.ncbs.res.in/download/pugal/RSARF/.
Assuntos
Proteínas/química , Análise de Sequência de Proteína/métodos , Software , Algoritmos , Sequência de Aminoácidos , Biologia Computacional , Simulação por Computador , Cristalografia por Raios X , Bases de Dados de Proteínas , Interações Hidrofóbicas e Hidrofílicas , Dados de Sequência Molecular , Valor Preditivo dos Testes , Conformação Proteica , Dobramento de Proteína , Solventes/químicaRESUMO
3D domain swapping is a protein structural phenomenon that mediates the formation of the higher order oligomers in a variety of proteins with different structural and functional properties. 3D domain swapping is associated with a variety of biological functions ranging from oligomerization to pathological conformational diseases. 3D domain swapping is realised subsequent to structure determination where the protein is observed in the swapped conformation in the oligomeric state. This is a limiting step to understand this important structural phenomenon in a large scale from the growing sequence data. A new machine learning approach, 3dswap-pred, has been developed for the prediction of 3D domain swapping in protein structures from mere sequence data using the Random Forest approach. 3Dswap-pred is implemented using a positive sequence dataset derived from literature based structural curation of 297 structures. A negative sequence dataset is obtained from 462 SCOP domains using a new sequence data mining approach and a set of 126 sequencederived features. Statistical validation using an independent dataset of 68 positive sequences and 313 negative sequences revealed that 3dswap-pred achieved an accuracy of 63.8%. A webserver is also implemented using the 3dswap-pred Random Forest model. The server is available from the URL: http://caps.ncbs.res.in/3dswap-pred.
Assuntos
Proteínas/química , Algoritmos , Estrutura Secundária de Proteína , Estrutura Terciária de ProteínaRESUMO
X-ray crystallography is the most widely used method for protein 3-dimensional structure determination. Selection of target protein that can yield high quality crystal for X-ray crystallography is a challenging task. Prediction of protein crystallization propensity from sequence information is useful for the selection of target protein for crystallization. Recently, support vector machines have been widely used to solve various biological problems. In this work, we present a SVMCRYS method which use support vector machine to classify protein sequence into 'amenable to crystallization' and 'resistant to crystallization'. SVMCRYS was trained on a dataset containing 728 sequences that gave diffraction quality crystal and 728 sequences where work had been stopped before obtaining crystal. The performance of SVMCRYS method was compared with other sequence-based crystallization prediction methods such as SECRET, CRYSTALP, OB-Score, ParCrys and XtalPred using three different datasets. SVMCRYS achieved better prediction rate with higher sensitivity and specificity. Our analysis suggests that SVMCRYS can be used to predict proteins which are amenable to crystallization and proteins which are difficult for crystallization. The SVMCRYS software, dataset and feature set can be obtained from http://www3.ntu.edu.sg/home/EPNSugan/index_files/svmcrys.htm.
Assuntos
Algoritmos , Sequência de Aminoácidos , Inteligência Artificial , Cristalografia por Raios X/métodos , Proteínas/química , Bases de Dados de Proteínas , Ressonância Magnética Nuclear Biomolecular , Proteínas/metabolismo , Curva ROC , Reprodutibilidade dos Testes , Relação Estrutura-AtividadeRESUMO
Knowledge of three dimensional structure is essential to understand the function of a protein. Although the overall fold is made from the whole details of its sequence, a small group of residues, often called as structural motifs, play a crucial role in determining the protein fold and its stability. Identification of such structural motifs requires sufficient number of sequence and structural homologs to define conservation and evolutionary information. Unfortunately, there are many structures in the protein structure databases have no homologous structures or sequences. In this work, we report an SVM method, SMpred, to identify structural motifs from single protein structure without using sequence and structural homologs. SMpred method was trained and tested using 132 proteins domains containing 581 motifs. SMpred method achieved 78.79% accuracy with 79.06% sensitivity and 78.53% specificity. The performance of SMpred was evaluated with MegaMotifBase using 188 proteins containing 1161 motifs. Out of 1161 motifs, SMpred correctly identified 1503 structural motifs reported in MegaMotifBase. Further, we showed that SMpred is useful approach for the length deviant superfamilies and single member superfamilies. This result suggests the usefulness of our approach for facilitating the identification of structural motifs in protein structure in the absence of sequence and structural homologs. The dataset and executable for the SMpred algorithm is available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/SMpred.htm.
Assuntos
Motivos de Aminoácidos , Bases de Dados de Proteínas , Evolução Molecular , Conformação Proteica , Proteínas/química , Software , Sequência de Aminoácidos , Modelos Moleculares , Proteínas/classificação , Proteínas/genética , Alinhamento de Sequência/métodosRESUMO
Apoptosis is an essential process for controlling tissue homeostasis by regulating a physiological balance between cell proliferation and cell death. The subcellular locations of proteins performing the cell death are determined by mostly independent cellular mechanisms. The regular bioinformatics tools to predict the subcellular locations of such apoptotic proteins do often fail. This work proposes a model for the sorting of proteins that are involved in apoptosis, allowing us to both the prediction of their subcellular locations as well as the molecular properties that contributed to it. We report a novel hybrid Genetic Algorithm (GA)/Support Vector Machine (SVM) approach to predict apoptotic protein sequences using 119 sequence derived properties like frequency of amino acid groups, secondary structure, and physicochemical properties. GA is used for selecting a near-optimal subset of informative features that is most relevant for the classification. Jackknife cross-validation is applied to test the predictive capability of the proposed method on 317 apoptosis proteins. Our method achieved 85.80% accuracy using all 119 features and 89.91% accuracy for 25 features selected by GA. Our models were examined by a test dataset of 98 apoptosis proteins and obtained an overall accuracy of 90.34%. The results show that the proposed approach is promising; it is able to select small subsets of features and still improves the classification accuracy. Our model can contribute to the understanding of programmed cell death and drug discovery. The software and dataset are available at http://www.inb.uni-luebeck.de/tools-demos/apoptosis/GASVM.
Assuntos
Proteínas Reguladoras de Apoptose/química , Algoritmos , Inteligência Artificial , Transporte ProteicoRESUMO
3-dimensional domain swapping is a mechanism where two or more protein molecules form higher order oligomers by exchanging identical or similar subunits. Recently, this phenomenon has received much attention in the context of prions and neurodegenerative diseases, due to its role in the functional regulation, formation of higher oligomers, protein misfolding, aggregation etc. While 3-dimensional domain swap mechanism can be detected from three-dimensional structures, it remains a formidable challenge to derive common sequence or structural patterns from proteins involved in swapping. We have developed a SVM-based classifier to predict domain swapping events using a set of features derived from sequence and structural data. The SVM classifier was trained on features derived from 150 proteins reported to be involved in 3D domain swapping and 150 proteins not known to be involved in swapped conformation or related to proteins involved in swapping phenomenon. The testing was performed using 63 proteins from the positive dataset and 63 proteins from the negative dataset. We obtained 76.33% accuracy from training and 73.81% accuracy from testing. Due to high diversity in the sequence, structure and functions of proteins involved in domain swapping, availability of such an algorithm to predict swapping events from sequence and structure-derived features will be an initial step towards identification of more putative proteins that may be involved in swapping or proteins involved in deposition disease. Further, the top features emerging in our feature selection method may be analysed further to understand their roles in the mechanism of domain swapping.