Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 16 de 16
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Sci Rep ; 13(1): 8049, 2023 05 17.
Artigo em Inglês | MEDLINE | ID: mdl-37198304

RESUMO

Traditionally, cyber-attack detection relies on reactive, assistive techniques, where pattern-matching algorithms help human experts to scan system logs and network traffic for known virus or malware signatures. Recent research has introduced effective Machine Learning (ML) models for cyber-attack detection, promising to automate the task of detecting, tracking and blocking malware and intruders. Much less effort has been devoted to cyber-attack prediction, especially beyond the short-term time scale of hours and days. Approaches that can forecast attacks likely to happen in the longer term are desirable, as this gives defenders more time to develop and share defensive actions and tools. Today, long-term predictions of attack waves are mostly based on the subjective perceptiveness of experienced human experts, which can be impaired by the scarcity of cyber-security expertise. This paper introduces a novel ML-based approach that leverages unstructured big data and logs to forecast the trend of cyber-attacks at a large scale, years in advance. To this end, we put forward a framework that utilises a monthly dataset of major cyber incidents in 36 countries over the past 11 years, with new features extracted from three major categories of big data sources, namely the scientific research literature, news, blogs, and tweets. Our framework not only identifies future attack trends in an automated fashion, but also generates a threat cycle that drills down into five key phases that constitute the life cycle of all 42 known cyber threats.


Assuntos
Algoritmos , Big Data , Humanos , Blogging , Segurança Computacional , Aprendizado de Máquina
2.
Annu Int Conf IEEE Eng Med Biol Soc ; 2020: 5284-5287, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-33019176

RESUMO

Each mixture of deficient molecular families of a specific disease induces the disease at a different time frame in the future. Based on this, we propose a novel methodology for personalizing a person's level of future susceptibility to a specific disease by inferring the mixture of his/her molecular families, whose combined deficiencies is likely to induce the disease. We implemented the methodology in a working system called DRIT, which consists of the following components: logic inferencer, information extractor, risk indicator, and interrelationship between molecular families modeler. The information extractor takes advantage of the exponential increase of biomedical literature to extract the common biomarkers that test positive among most patients with a specific disease. The logic inferencer transforms the hierarchical interrelationships between the molecular families of a disease into rule-based specifications. The interrelationship between molecular families modeler models the hierarchical interrelationships between the molecular families, whose biomarkers were extracted by the information extractor. It employs the specification rules and the inference rules for predicate logic to infer as many as possible probable deficient molecular families for a person based on his/her few molecular families, whose biomarkers tested positive by medical screening. The risk indicator outputs a risk indicator value that reflects a person's level of future susceptibility to the disease. We evaluated DRIT by comparing it experimentally with a comparable method. Results revealed marked improvement.


Assuntos
Lógica , Publicações , Biomarcadores , Feminino , Humanos , Masculino , Fatores de Risco
3.
IEEE Trans Cybern ; 50(2): 525-535, 2020 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-30281507

RESUMO

Over the past decade, keystroke-based pattern recognition techniques, as a forensic tool for behavioral biometrics, have gained increasing attention. Although a number of machine learning-based approaches have been proposed, they are limited in terms of their capability to recognize and profile a set of an individual's characteristics. In addition, up to today, their focus was primarily gender and age, which seem to be more appropriate for commercial applications (such as developing commercial software), leaving out from research other characteristics, such as the educational level. Educational level is an acquired user characteristic, which can improve targeted advertising, as well as provide valuable information in a digital forensic investigation, when it is known. In this context, this paper proposes a novel machine learning model, the randomized radial basis function network, which recognizes and profiles the educational level of an individual who stands behind the keyboard. The performance of the proposed model is evaluated by using the empirical data obtained by recording volunteers' keystrokes during their daily usage of a computer. Its performance is also compared with other well-referenced machine learning models using our keystroke dynamic datasets. Although the proposed model achieves high accuracy in educational level prediction of an unknown user, it suffers from high computational cost. For this reason, we examine ways to reduce the time that is needed to build our model, including the use of a novel data condensation method, and discuss the tradeoff between an accurate and a fast prediction. To the best of our knowledge, this is the first model in the literature that predicts the educational level of an individual based on the keystroke dynamics information only.

5.
BMC Bioinformatics ; 17: 34, 2016 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-26767846

RESUMO

BACKGROUND: All proteins associate with other molecules. These associated molecules are highly predictive of the potential functions of proteins. The association of a protein and a molecule can be determined from their co-occurrences in biomedical abstracts. Extensive semantically related co-occurrences of a protein's name and a molecule's name in the sentences of biomedical abstracts can be considered as indicative of the association between the protein and the molecule. Dependency parsers extract textual relations from a text by determining the grammatical relations between words in a sentence. They can be used for determining the textual relations between proteins and molecules. Despite their success, they may extract textual relations with low precision. This is because they do not consider the semantic relationships between terms in a sentence (i.e., they consider only the structural relationships between the terms). Moreover, they may not be well suited for complex sentences and for long-distance textual relations. RESULTS: We introduce an information extraction system called PPFBM that predicts the functions of unannotated proteins from the molecules that associate with these proteins. PPFBM represents each protein by the other molecules that associate with it in the abstracts referenced in the protein's entries in reliable biological databases. It automatically extracts each co-occurrence of a protein-molecule pair that represents semantic relationship between the pair. Towards this, we present novel semantic rules that identify the semantic relationship between each co-occurrence of a protein-molecule pair using the syntactic structures of sentences and linguistics theories. PPFBM determines the functions of an un-annotated protein p as follows. First, it determines the set S r of annotated proteins that is semantically similar to p by matching the molecules representing p and the annotated proteins. Then, it assigns p the functional category FC if the significance of the frequency of occurrences of S r in abstracts associated with proteins annotated with FC is statistically significantly different than the significance of the frequency of occurrences of S r in abstracts associated with proteins annotated with all other functional categories. We evaluated the quality of PPFBM by comparing it experimentally with two other systems. Results showed marked improvement. CONCLUSIONS: The experimental results demonstrated that PPFBM outperforms other systems that predict protein function from the textual information found within biomedical abstracts. This is because these system do not consider the semantic relationships between terms in a sentence (i.e., they consider only the structural relationships between the terms). PPFBM's performance over these system increases steadily as the number of training protein increases. That is, PPFBM's prediction performance becomes more accurate constantly, as the size of training proteins gets larger. This is because every time a new set of test proteins is added to the current set of training proteins. A demo of PPFBM that annotates each input Yeast protein (SGD (Saccharomyces Genome Database). Available at: http://www.yeastgenome.org/download-data/curation) with the functions of Gene Ontology terms is available at: (see Appendix for more details about the demo) http://ecesrvr.kustar.ac.ae:8080/PPFBM/.


Assuntos
Armazenamento e Recuperação da Informação/métodos , Anotação de Sequência Molecular , Complexos Multiproteicos/metabolismo , Software , Bases de Dados Factuais , Genoma Fúngico , Proteínas/metabolismo , Proteínas/fisiologia , PubMed , Saccharomyces/genética , Saccharomyces/metabolismo , Semântica
6.
IEEE Trans Cybern ; 46(8): 1796-806, 2016 08.
Artigo em Inglês | MEDLINE | ID: mdl-26540724

RESUMO

Botnets, which consist of remotely controlled compromised machines called bots, provide a distributed platform for several threats against cyber world entities and enterprises. Intrusion detection system (IDS) provides an efficient countermeasure against botnets. It continually monitors and analyzes network traffic for potential vulnerabilities and possible existence of active attacks. A payload-inspection-based IDS (PI-IDS) identifies active intrusion attempts by inspecting transmission control protocol and user datagram protocol packet's payload and comparing it with previously seen attacks signatures. However, the PI-IDS abilities to detect intrusions might be incapacitated by packet encryption. Traffic-based IDS (T-IDS) alleviates the shortcomings of PI-IDS, as it does not inspect packet payload; however, it analyzes packet header to identify intrusions. As the network's traffic grows rapidly, not only the detection-rate is critical, but also the efficiency and the scalability of IDS become more significant. In this paper, we propose a state-of-the-art T-IDS built on a novel randomized data partitioned learning model (RDPLM), relying on a compact network feature set and feature selection techniques, simplified subspacing and a multiple randomized meta-learning technique. The proposed model has achieved 99.984% accuracy and 21.38 s training time on a well-known benchmark botnet dataset. Experiment results demonstrate that the proposed methodology outperforms other well-known machine-learning models used in the same detection task, namely, sequential minimal optimization, deep neural network, C4.5, reduced error pruning tree, and randomTree.

7.
Artigo em Inglês | MEDLINE | ID: mdl-26357314

RESUMO

Proline residues are common source of kinetic complications during folding. The X-Pro peptide bond is the only peptide bond for which the stability of the cis and trans conformations is comparable. The cis-trans isomerization (CTI) of X-Pro peptide bonds is a widely recognized rate-limiting factor, which can not only induces additional slow phases in protein folding but also modifies the millisecond and sub-millisecond dynamics of the protein. An accurate computational prediction of proline CTI is of great importance for the understanding of protein folding, splicing, cell signaling, and transmembrane active transport in both the human body and animals. In our earlier work, we successfully developed a biophysically motivated proline CTI predictor utilizing a novel tree-based consensus model with a powerful metalearning technique and achieved 86.58 percent Q2 accuracy and 0.74 Mcc, which is a better result than the results (70-73 percent Q2 accuracies) reported in the literature on the well-referenced benchmark dataset. In this paper, we describe experiments with novel randomized subspace learning and bootstrap seeding techniques as an extension to our earlier work, the consensus models as well as entropy-based learning methods, to obtain better accuracy through a precise and robust learning scheme for proline CTI prediction.


Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina , Prolina/química , Algoritmos , Isomerismo , Modelos Moleculares
8.
Artigo em Inglês | MEDLINE | ID: mdl-26357323

RESUMO

We propose a classifier system called iPFPi that predicts the functions of un-annotated proteins. iPFPi assigns an un-annotated protein P the functions of GO annotation terms that are semantically similar to P. An un-annotated protein P and a GO annotation term T are represented by their characteristics. The characteristics of P are GO terms found within the abstracts of biomedical literature associated with P. The characteristics of Tare GO terms found within the abstracts of biomedical literature associated with the proteins annotated with the function of T. Let F and F/ be the important (dominant) sets of characteristic terms representing T and P, respectively. iPFPi would annotate P with the function of T, if F and F/ are semantically similar. We constructed a novel semantic similarity measure that takes into consideration several factors, such as the dominance degree of each characteristic term t in set F based on its score, which is a value that reflects the dominance status of t relative to other characteristic terms, using pairwise beats and looses procedure. Every time a protein P is annotated with the function of T, iPFPi updates and optimizes the current scores of the characteristic terms for T based on the weights of the characteristic terms for P. Set F will be updated accordingly. Thus, the accuracy of predicting the function of T as the function of subsequent proteins improves. This prediction accuracy keeps improving over time iteratively through the cumulative weights of the characteristic terms representing proteins that are successively annotated with the function of T. We evaluated the quality of iPFPi by comparing it experimentally with two recent protein function prediction systems. Results showed marked improvement.


Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Proteínas/classificação , Proteínas/metabolismo , Anotação de Sequência Molecular , Proteínas/química , Reprodutibilidade dos Testes , Semântica
9.
Artigo em Inglês | MEDLINE | ID: mdl-26736991

RESUMO

We propose a classifier system called PFPBT that predicts the functions of un-annotated proteins. PFPBT assigns an un-annotated protein p the functional category of annotated proteins that are semantically similar to p. Each protein p is represented by a vector of weights. Each weight reflects the significance of a molecule m in the biomedical abstracts associated with p. That is, each weight quantifies the likelihood of the association between m and p. This is because all proteins bind to other molecules, which are highly predictive of the functions of the proteins. Let S be the set of proteins that is semantically similar to an un-annotated protein p. p is annotated with the functional category f, if its occurrence probability in abstracts associated with S whose functional category is f is statistically significantly different than its occurrences in abstracts associated with S that belong to all other functional categories. PFPBT automatically extracts each co-occurrence of a protein-molecule pair that represents semantic relationship between the pair. We present novel semantic rules based on the syntactic structures of sentences for identifying the semantic relationships between each co-occurrence of a protein-molecule pair in a sentence. We evaluated PFPBT by comparing it experimentally with two systems. Results showed improvement.


Assuntos
Algoritmos , Mineração de Dados , Proteínas/metabolismo , Ontologia Genética , Anotação de Sequência Molecular , Proteínas/química , Semântica
10.
BMC Bioinformatics ; 15 Suppl 16: S8, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25521329

RESUMO

BACKGROUND: Protein chains are generally long and consist of multiple domains. Domains are distinct structural units of a protein that can evolve and function independently. The accurate prediction of protein domain linkers and boundaries is often regarded as the initial step of protein tertiary structure and function predictions. Such information not only enhances protein-targeted drug development but also reduces the experimental cost of protein analysis by allowing researchers to work on a set of smaller and independent units. In this study, we propose a novel and accurate domain-linker prediction approach based on protein primary structure information only. We utilize a nature-inspired machine-learning model called Random Forest along with a novel domain-linker profile that contains physiochemical and domain-linker information of amino acid sequences. RESULTS: The proposed approach was tested on two well-known benchmark protein datasets and achieved 68% sensitivity and 99% precision, which is better than any existing protein domain-linker predictor. Without applying any data balancing technique such as class weighting and data re-sampling, the proposed approach is able to accurately classify inter-domain linkers from highly imbalanced datasets. CONCLUSION: Our experimental results prove that the proposed approach is useful for domain-linker identification in highly imbalanced single- and multi-domain proteins.


Assuntos
Algoritmos , Aminoácidos/química , Modelos Estatísticos , Proteínas/química , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Fenômenos Químicos , Conjuntos de Dados como Assunto , Humanos , Interações Hidrofóbicas e Hidrofílicas , Dados de Sequência Molecular , Homologia de Sequência de Aminoácidos
11.
IEEE Trans Cybern ; 44(3): 445-55, 2014 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-24108722

RESUMO

Data sampling is a widely used technique in a broad range of machine learning problems. Traditional sampling approaches generally rely on random resampling from a given dataset. However, these approaches do not take into consideration additional information, such as sample quality and usefulness. We recently proposed a data sampling technique, called sample subset optimization (SSO). The SSO technique relies on a cross-validation procedure for identifying and selecting the most useful samples as subsets. In this paper, we describe the application of SSO techniques to imbalanced and ensemble learning problems, respectively. For imbalanced learning, the SSO technique is employed as an under-sampling technique for identifying a subset of highly discriminative samples in the majority class. In ensemble learning, the SSO technique is utilized as a generic ensemble technique where multiple optimized subsets of samples from each class are selected for building an ensemble classifier. We demonstrate the utilities and advantages of the proposed techniques on a variety of bioinformatics applications where class imbalance, small sample size, and noisy data are prevalent.


Assuntos
Algoritmos , Biologia Computacional/métodos , Interpretação Estatística de Dados , Aprendizado de Máquina , Modelos Estatísticos , Reconhecimento Automatizado de Padrão/métodos , Tamanho da Amostra , Simulação por Computador
12.
Artigo em Inglês | MEDLINE | ID: mdl-26355504

RESUMO

Proline cis-trans isomerization (CTI) plays a key role in the rate-determining steps of protein folding. Accurate prediction of proline CTI is of great importance for the understanding of protein folding, splicing, cell signaling, and transmembrane active transport in both the human body and animals. Our goal is to develop a state-of-the-art proline CTI predictor based on a biophysically motivated intelligent consensus modeling through the use of sequence information only (i.e., position specific scores generated by PSI-BLAST). The current computational proline CTI predictors reach about 70-73 percent Q2 accuracies and about 0.40 Matthew correlation coefficient (Mcc) through the use of sequence-based evolutionary information as well as predicted protein secondary structure information. However, our approach that utilizes a novel decision tree-based consensus model with a powerful randomized-metal earning technique has achieved 86.58 percent Q2 accuracy and 0.74 Mcc, on the same proline CTI data set, which is a better result than those of any existing computational proline CTI predictors reported in the literature.


Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina , Prolina/química , Análise de Sequência de Proteína/métodos , Árvores de Decisões , Isomerismo , Prolina/metabolismo , Dobramento de Proteína , Reprodutibilidade dos Testes
13.
BMC Genomics ; 11 Suppl 4: S22, 2010 Dec 02.
Artigo em Inglês | MEDLINE | ID: mdl-21143806

RESUMO

Changes to the glycosylation profile on HIV gp120 can influence viral pathogenesis and alter AIDS disease progression. The characterization of glycosylation differences at the sequence level is inadequate as the placement of carbohydrates is structurally complex. However, no structural framework is available to date for the study of HIV disease progression. In this study, we propose a novel machine-learning based framework for the prediction of AIDS disease progression in three stages (RP, SP, and LTNP) using the HIV structural gp120 profile. This new intelligent framework proves to be accurate and provides an important benchmark for predicting AIDS disease progression computationally. The model is trained using a novel HIV gp120 glycosylation structural profile to detect possible stages of AIDS disease progression for the target sequences of HIV+ individuals. The performance of the proposed model was compared to seven existing different machine-learning models on newly proposed gp120-Benchmark_1 dataset in terms of error-rate (MSE), accuracy (CCI), stability (STD), and complexity (TBM). The novel framework showed better predictive performance with 67.82% CCI, 30.21 MSE, 0.8 STD, and 2.62 TBM on the three stages of AIDS disease progression of 50 HIV+ individuals. This framework is an invaluable bioinformatics tool that will be useful to the clinical assessment of viral pathogenesis.


Assuntos
Síndrome da Imunodeficiência Adquirida/diagnóstico , Biologia Computacional , Proteína gp120 do Envelope de HIV , Software , Inteligência Artificial , Benchmarking , Estudos de Coortes , Progressão da Doença , Glicosilação , Proteína gp120 do Envelope de HIV/genética , Proteína gp120 do Envelope de HIV/metabolismo , Soropositividade para HIV , Humanos , Modelos Biológicos , Valor Preditivo dos Testes
14.
BMC Genomics ; 10 Suppl 3: S21, 2009 Dec 03.
Artigo em Inglês | MEDLINE | ID: mdl-19958485

RESUMO

BACKGROUND: In this paper, we introduce a novel inter-range interaction integrated approach for protein domain boundary prediction. It involves (1) the design of modular kernel algorithm, which is able to effectively exploit the information of non-local interactions in amino acids, and (2) the development of a novel profile that can provide suitable information to the algorithm. One of the key features of this profiling technique is the use of multiple structural alignments of remote homologues to create an extended sequence profile and combines the structural information with suitable chemical information that plays an important role in protein stability. This profile can capture the sequence characteristics of an entire structural superfamily and extend a range of profiles generated from sequence similarity alone. RESULTS: Our novel profile that combines homology information with hydrophobicity from SARAH1 scale was successful in providing more structural and chemical information. In addition, the modular approach adopted in our algorithm proved to be effective in capturing information from non-local interactions. Our approach achieved 82.1%, 50.9% and 31.5% accuracies for one-domain, two-domain, and three- and more domain proteins respectively. CONCLUSION: The experimental results in this study are encouraging, however, more work is need to extend it to a broader range of applications. We are currently developing a novel interactive (human in the loop) profiling that can provide information from more distantly related homology. This approach will further enhance the current study.


Assuntos
Biologia Computacional , Estrutura Terciária de Proteína , Proteínas/análise , Algoritmos , Humanos , Interações Hidrofóbicas e Hidrofílicas , Estrutura Secundária de Proteína , Proteínas/química , Homologia Estrutural de Proteína
15.
BMC Bioinformatics ; 9: 272, 2008 Jun 10.
Artigo em Inglês | MEDLINE | ID: mdl-18541042

RESUMO

BACKGROUND: Post-translational modifications have a substantial influence on the structure and functions of protein. Post-translational phosphorylation is one of the most common modification that occur in intracellular proteins. Accurate prediction of protein phosphorylation sites is of great importance for the understanding of diverse cellular signalling processes in both the human body and in animals. In this study, we propose a new machine learning based protein phosphorylation site predictor, SiteSeek. SiteSeek is trained using a novel compact evolutionary and hydrophobicity profile to detect possible protein phosphorylation sites for a target sequence. The newly proposed method proves to be more accurate and exhibits a much stable predictive performance than currently existing phosphorylation site predictors. RESULTS: The performance of the proposed model was compared to nine existing different machine learning models and four widely known phosphorylation site predictors with the newly proposed PS-Benchmark_1 dataset to contrast their accuracy, sensitivity, specificity and correlation coefficient. SiteSeek showed better predictive performance with 86.6% accuracy, 83.8% sensitivity, 92.5% specificity and 0.77 correlation-coefficient on the four main kinase families (CDK, CK2, PKA, and PKC). CONCLUSION: Our newly proposed methods used in SiteSeek were shown to be useful for the identification of protein phosphorylation sites as it performed much better than widely known predictors on the newly built PS-Benchmark_1 dataset.


Assuntos
Algoritmos , Proteínas/química , Proteínas/metabolismo , Processamento Pós-Transcricional do RNA/fisiologia , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Sítios de Ligação , Dados de Sequência Molecular , Fosforilação , Ligação Proteica
16.
BMC Bioinformatics ; 9 Suppl 1: S12, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18315843

RESUMO

BACKGROUND: Protein domains present some of the most useful information that can be used to understand protein structure and functions. Recent research on protein domain boundary prediction has been mainly based on widely known machine learning techniques, such as Artificial Neural Networks and Support Vector Machines. In this study, we propose a new machine learning model (IGRN) that can achieve accurate and reliable classification, with significantly reduced computations. The IGRN was trained using a PSSM (Position Specific Scoring Matrix), secondary structure, solvent accessibility information and inter-domain linker index to detect possible domain boundaries for a target sequence. RESULTS: The proposed model achieved average prediction accuracy of 67% on the Benchmark_2 dataset for domain boundary identification in multi-domains proteins and showed superior predictive performance and generalisation ability among the most widely used neural network models. With the CASP7 benchmark dataset, it also demonstrated comparable performance to existing domain boundary predictors such as DOMpro, DomPred, DomSSEA, DomCut and DomainDiscovery with 70.10% prediction accuracy. CONCLUSION: The performance of proposed model has been compared favourably to the performance of other existing machine learning based methods as well as widely known domain boundary predictors on two benchmark datasets and excels in the identification of domain boundaries in terms of model bias, generalisation and computational requirements.


Assuntos
Algoritmos , Redes Neurais de Computação , Reconhecimento Automatizado de Padrão/métodos , Proteínas/química , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Dados de Sequência Molecular , Estrutura Terciária de Proteína , Análise de Regressão
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...