Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 30
Filtrar
1.
J Theor Biol ; 423: 63-70, 2017 06 21.
Artigo em Inglês | MEDLINE | ID: mdl-28454901

RESUMO

Integrase catalytic domain (ICD) is an essential part in the retrovirus for integration reaction, which enables its newly synthesized DNA to be incorporated into the DNA of infected cells. Owing to the crucial role of ICD for the retroviral replication and the absence of an equivalent of integrase in host cells, it is comprehensible that ICD is a promising drug target for therapeutic intervention. However, annotated ICDs in UniProtKB database have still been insufficient for a good understanding of their statistical characteristics so far. Accordingly, it is of great importance to put forward a computational ICD model in this work to annotate these domains in the retroviruses. The proposed model then discovered 11,660 new putative ICDs after scanning sequences without ICD annotations. Subsequently in order to provide much confidence in ICD prediction, it was tested under different cross-validation methods, compared with other database search tools, and verified on independent datasets. Furthermore, an evolutionary analysis performed on the annotated ICDs of retroviruses revealed a tight connection between ICD and retroviral classification. All the datasets involved in this paper and the application software tool of this model can be available for free download at https://sourceforge.net/projects/icdtool/files/?source=navbar.


Assuntos
Domínio Catalítico , Biologia Computacional , Evolução Molecular , Integrases/química , Retroviridae/classificação , Análise de Sequência de Proteína , Simulação por Computador , Bases de Dados de Proteínas , Anotação de Sequência Molecular , Software
2.
J Theor Biol ; 415: 84-89, 2017 02 21.
Artigo em Inglês | MEDLINE | ID: mdl-27908705

RESUMO

Regulatory single nucleotide polymorphisms (rSNPs), kind of functional noncoding genetic variants, can affect gene expression in a regulatory way, and they are thought to be associated with increased susceptibilities to complex diseases. Here a novel computational approach to identify potential rSNPs is presented. Different from most other rSNPs finding methods which based on hypothesis that SNPs causing large allele-specific changes in transcription factor binding affinities are more likely to play regulatory functions, we use a set of documented experimentally verified rSNPs and nonfunctional background SNPs to train classifiers, so the discriminating features are found. To characterize variants, an extensive range of characteristics, such as sequence context, DNA structure and evolutionary conservation etc. are analyzed. Support vector machine is adopted to build the classifier model together with an ensemble method to deal with unbalanced data. 10-fold cross-validation result shows that our method can achieve accuracy with sensitivity of ~78% and specificity of ~82%. Furthermore, our method performances better than some other algorithms based on aforementioned hypothesis in handling false positives. The original data and the source matlab codes involved are available at https://sourceforge.net/projects/rsnppredict/.


Assuntos
Simulação por Computador , Regulação da Expressão Gênica , Genoma Humano , Polimorfismo de Nucleotídeo Único/genética , Algoritmos , Biologia Computacional/métodos , Humanos , Métodos , Sensibilidade e Especificidade , Aprendizado de Máquina Supervisionado
3.
Sensors (Basel) ; 16(5)2016 05 13.
Artigo em Inglês | MEDLINE | ID: mdl-27187402

RESUMO

Beam pumping units are widely used in the oil production industry, but the energy efficiency of this artificial lift machinery is generally low, especially for the low-production well and high-production well in the later stage. There are a number of ways for energy savings in pumping units, with the periodic adjustment of stroke speed and rectification of balance deviation being two important methods. In the paper, an energy saving system for a beam pumping unit (ESS-BPU) based on the Internet of Things (IoT) was proposed. A total of four types of sensors, including load sensor, angle sensor, voltage sensor, and current sensor, were used to detect the operating conditions of the pumping unit. Data from these sensors was fed into a controller installed in an oilfield to adjust the stroke speed automatically and estimate the degree of balance in real-time. Additionally, remote supervision could be fulfilled using a browser on a computer or smartphone. Furthermore, the data from a practical application was recorded and analyzed, and it can be seen that ESS-BPU is helpful in reducing energy loss caused by unnecessarily high stroke speed and a poor degree of balance.

4.
Sensors (Basel) ; 16(9)2016 Aug 25.
Artigo em Inglês | MEDLINE | ID: mdl-27571078

RESUMO

Surface defect detection and dimension measurement of automotive bevel gears by manual inspection are costly, inefficient, low speed and low accuracy. In order to solve these problems, a synthetic bevel gear quality inspection system based on multi-camera vision technology is developed. The system can detect surface defects and measure gear dimensions simultaneously. Three efficient algorithms named Neighborhood Average Difference (NAD), Circle Approximation Method (CAM) and Fast Rotation-Position (FRP) are proposed. The system can detect knock damage, cracks, scratches, dents, gibbosity or repeated cutting of the spline, etc. The smallest detectable defect is 0.4 mm × 0.4 mm and the precision of dimension measurement is about 40-50 µm. One inspection process takes no more than 1.3 s. Both precision and speed meet the requirements of real-time online inspection in bevel gear production.

5.
Sensors (Basel) ; 15(4): 9000-21, 2015 Apr 16.
Artigo em Inglês | MEDLINE | ID: mdl-25894940

RESUMO

With the continuing growth of highway construction and vehicle use expansion all over the world, highway vehicle traffic rule violation (TRV) detection has become more and more important so as to avoid traffic accidents and injuries in intelligent transportation systems (ITS) and vehicular ad hoc networks (VANETs). Since very few works have contributed to solve the TRV detection problem by moving vehicle measurements and surveillance devices, this paper develops a novel parallel ultrasonic sensor system that can be used to identify the TRV behavior of a host vehicle in real-time. Then a two-dimensional state method is proposed, utilizing the spacial state and time sequential states from the data of two parallel ultrasonic sensors to detect and count the highway vehicle violations. Finally, the theoretical TRV identification probability is analyzed, and actual experiments are conducted on different highway segments with various driving speeds, which indicates that the identification accuracy of the proposed method can reach about 90.97%.

6.
J Theor Biol ; 360: 78-82, 2014 Nov 07.
Artigo em Inglês | MEDLINE | ID: mdl-25008418

RESUMO

Immunosuppressive domain (ISD) is a conserved region of transmembrane proteins (TM) in envelope gene (env) of retroviruses. in vitro and vivo, a synthetic peptide (CKS-17) that shows homology to ISD inhibits immune function. Evidence has shown that ISD suppresses lymphocyte proliferation and allows escape from immune effectors of the innate and adaptive arms in mouse immune system. Previously, we have developed a tool ISDTool 1.0 to identify ISD of human endogenous retrovirus (HERV). However, several other important retroviruses exist and no method is devoted to ISD prediction of them so far. In the paper, a computational model is proposed to identify ISD of six typical retroviruses from three species. The model combines the minimum Redundancy Maximum Relevance (mRMR) feature selection criterion with weighted extreme learning machine (WELM) to achieve high identification accuracies of 98.95%, 96.34% and 96.87% using self-consistency, 5-fold and 10-fold cross-validation, respectively. A software tool named ISDTool 2.0 has been developed to facilitate the application of the model and a large number of new putative ISDs of the six retroviruses were predicted. In addition, motifs of ISD in these retroviruses were analyzed and the evolutionary relationship was discussed. Datasets and the software involved in the paper are available at http://sourceforge.net/projects/isdtool/files/ISDTool-2.0/.


Assuntos
Retrovirus Endógenos/genética , Retrovirus Endógenos/imunologia , Tolerância Imunológica/imunologia , Modelos Imunológicos , Software , Proteínas do Envelope Viral/genética , Animais , Inteligência Artificial , Humanos , Camundongos , Estrutura Terciária de Proteína
7.
Sensors (Basel) ; 14(8): 13794-814, 2014 Jul 30.
Artigo em Inglês | MEDLINE | ID: mdl-25196106

RESUMO

The so-called Internet of Things (IoT) has attracted increasing attention in the field of computer and information science. In this paper, a specific application of IoT, named Safety Management System for Tower Crane Groups (SMS-TC), is proposed for use in the construction industry field. The operating status of each tower crane was detected by a set of customized sensors, including horizontal and vertical position sensors for the trolley, angle sensors for the jib and load, tilt and wind speed sensors for the tower body. The sensor data is collected and processed by the Tower Crane Safety Terminal Equipment (TC-STE) installed in the driver's operating room. Wireless communication between each TC-STE and the Local Monitoring Terminal (LMT) at the ground worksite were fulfilled through a Zigbee wireless network. LMT can share the status information of the whole group with each TC-STE, while the LMT records the real-time data and reports it to the Remote Supervision Platform (RSP) through General Packet Radio Service (GPRS). Based on the global status data of the whole group, an anti-collision algorithm was executed in each TC-STE to ensure the safety of each tower crane during construction. Remote supervision can be fulfilled using our client software installed on a personal computer (PC) or smartphone. SMS-TC could be considered as a promising practical application that combines a Wireless Sensor Network with the Internet of Things.


Assuntos
Redes de Comunicação de Computadores/instrumentação , Internet/instrumentação , Gestão da Segurança/métodos , Tecnologia sem Fio/instrumentação , Algoritmos , Desenho de Equipamento/instrumentação , Sistemas de Informação Administrativa , Microcomputadores , Processamento de Sinais Assistido por Computador/instrumentação , Software , Interface Usuário-Computador
8.
Biochem Biophys Res Commun ; 431(2): 221-4, 2013 Feb 08.
Artigo em Inglês | MEDLINE | ID: mdl-23313482

RESUMO

Alternative splicing (AS) increases protein diversity by generating multiple transcript isoforms from a single gene in higher eukaryotes. Up to 48% of plant genes exhibit alternative splicing, which has proven to be involved in some important plant functions such as the stress response. A hybrid feature extraction approach which combing the position weight matrix (PWM) with the increment of diversity (ID) was proposed to represent the base conservative level (BCL) near splice sites and the similarity level of two datasets, respectively. Using the extracted features, the support vector machine (SVM) was applied to classify alternative and constitutive splice sites. By the proposed algorithm, 80.8% of donor sites and 85.4% of acceptor sites were correctly classified. It is anticipated that the novel computational method is promising for the identification of AS sites in plants.


Assuntos
Processamento Alternativo , Biologia Computacional , Genoma de Planta/genética , Plantas/genética , Sítios de Splice de RNA , RNA de Plantas/genética , Análise de Sequência de DNA/métodos , Algoritmos , Arabidopsis/genética
9.
Sensors (Basel) ; 13(3): 3142-56, 2013 Mar 06.
Artigo em Inglês | MEDLINE | ID: mdl-23467056

RESUMO

The performance of conventional minutiae-based fingerprint authentication algorithms degrades significantly when dealing with low quality fingerprints with lots of cuts or scratches. A similar degradation of the minutiae-based algorithms is observed when small overlapping areas appear because of the quite narrow width of the sensors. Based on the detection of minutiae, Scale Invariant Feature Transformation (SIFT) descriptors are employed to fulfill verification tasks in the above difficult scenarios. However, the original SIFT algorithm is not suitable for fingerprint because of: (1) the similar patterns of parallel ridges; and (2) high computational resource consumption. To enhance the efficiency and effectiveness of the algorithm for fingerprint verification, we propose a SIFT-based Minutia Descriptor (SMD) to improve the SIFT algorithm through image processing, descriptor extraction and matcher. A two-step fast matcher, named improved All Descriptor-Pair Matching (iADM), is also proposed to implement the 1:N verifications in real-time. Fingerprint Identification using SMD and iADM (FISiA) achieved a significant improvement with respect to accuracy in representative databases compared with the conventional minutiae-based method. The speed of FISiA also can meet real-time requirements.


Assuntos
Algoritmos , Dermatoglifia , Reconhecimento Automatizado de Padrão , Inteligência Artificial , Biometria , Humanos , Processamento de Imagem Assistida por Computador , Armazenamento e Recuperação da Informação
10.
Guang Pu Xue Yu Guang Pu Fen Xi ; 32(11): 2976-80, 2012 Nov.
Artigo em Chinês | MEDLINE | ID: mdl-23387161

RESUMO

To non-dispersive infrared gas analysis, it was the most difficult challenge to maintain very low zero and temperature drift over long periods. Electronic and detector response drifts irremediably required some form of manual zeroing procedure. To solve zero and temperature drift, a multi-parameters model was developed, by which zero and temperature drifts were automatically corrected. These parameters include zero gas intensity, reference channels intensity, standard temperature, environment temperature, temperature drift coefficients etc. Trial results and in-situ applications showed that the monitoring precisions of the instrument were lesser than 5% F. S in different temperatures and for a long time. The average precision of monitoring carbon monoxide concentration increased respectively from 9.26 to 1.23%, and monitoring hydrocarbon concentration from 10.61% to 0.70% before and after compensated. The instrument required essentially no periodic calibration and have very low maintenance cost.

11.
Commun Biol ; 5(1): 608, 2022 06 20.
Artigo em Inglês | MEDLINE | ID: mdl-35725901

RESUMO

Topologically associating domains (TADs) are fundamental building blocks of three dimensional genome, and organized into complex hierarchies. Identifying hierarchical TADs on Hi-C data helps to understand the relationship between genome architectures and gene regulation. Herein we propose TADfit, a multivariate linear regression model for profiling hierarchical chromatin domains, which tries to fit the interaction frequencies in Hi-C contact matrix with and without replicates using all-possible hierarchical TADs, and the significant ones can be determined by the regression coefficients obtained with the help of an online learning solver called Follow-The-Regularized-Leader (FTRL). Beyond the existing methods, TADfit has an ability to handle multiple contact matrix replicates and find partially overlapping TADs on them, which helps to find the comprehensive underlying TADs across replicates from different experiments. The comparative results tell that TADfit has better accuracy and reproducibility, and the hierarchical TADs called by it exhibit a reasonable biological relevance.


Assuntos
Cromatina , Cromossomos , Cromatina/genética , Genoma , Modelos Lineares , Reprodutibilidade dos Testes
12.
Guang Pu Xue Yu Guang Pu Fen Xi ; 31(11): 3050-4, 2011 Nov.
Artigo em Chinês | MEDLINE | ID: mdl-22242515

RESUMO

Miniature mobile field spectrometry is pivotal equipment for qualitative and quantitative in-situ analysis of chemical substances. To solve the problem of spectrum signal interfered by complicated noise, overlapped and irregular peak shape recognition, and quick monitoring, an integrated on-line processing method for spectrometric data based on wavelet transform and Gaussian fitting was developed. In this way, toluene and perfluorotributylamine were processed, and the results shows that the integrated method can powerfully and effectively eliminate the noise, retain the original feature, and correct the overlapped and asymmetrical peaks, which can improve the analysis accuracy of instrument, and also achieve data compression. In addition, the method satisfies the requirement of on-site analysis for mobile field spectrometry. For the processing of mass spectra of toluene, at the characteristic peaks of 91 and 92, the SNR increased 1.3 times compared to that of moving average smoothing method, while the error between original peaks and theoretic peaks decreased 3.6 times. In addition, Gaussian fitting described the multipoint mass spectra data by three Gaussian parameters, and achieved data compression. For the processing of mass spectrogram of perfluorotributylamine, the ratio of compression was 197 : 1.

13.
J Bioinform Comput Biol ; 15(3): 1750010, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-28403667

RESUMO

Transmembrane region (TR) is a conserved region of transmembrane (TM) subunit in envelope (env) glycoprotein of retrovirus. Evidences have shown that TR is responsible for anchoring the env glycoprotein on the lipid bilayer and substitution of the TR for a covalently linked lipid anchor abrogates fusion. However, universal software could not achieve sufficient accuracy as TM in env also has several motifs such as signal peptide, fusion peptide and immunosuppressive domain composed largely of hydrophobic residues. In this paper, a support vector machine-based (SVM) model is proposed to identify TRs in retroviruses. Firstly, physicochemical and evolutionary information properties were extracted as original features. And then, the feature importance was analyzed by minimum Redundancy Maximum Relevance (mRMR) feature selection criterion. Our model achieved an Sn of 0.955, Sp of 0.998, ACC of 0.995, MCC of 0.954 using 10-fold cross-validation on the training dataset. These results suggest that the proposed model can be used to predict TRs in non-annotation retroviruses and 11917, 3344, 2, 289 and 6 new putative TRs were found in HERV, HIV, HTLV, SIV, MLV, respectively.


Assuntos
Algoritmos , Produtos do Gene env/química , Retroviridae/química , Proteínas do Envelope Viral/química , Membrana Celular/virologia , Simulação por Computador , Produtos do Gene env/metabolismo , Retroviridae/metabolismo , Software , Máquina de Vetores de Suporte , Proteínas do Envelope Viral/metabolismo
14.
Comput Biol Med ; 89: 264-274, 2017 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-28850898

RESUMO

A filter feature selection technique has been widely used to mine biomedical data. Recently, in the classical filter method minimal-Redundancy-Maximal-Relevance (mRMR), a risk has been revealed that a specific part of the redundancy, called irrelevant redundancy, may be involved in the minimal-redundancy component of this method. Thus, a few attempts to eliminate the irrelevant redundancy by attaching additional procedures to mRMR, such as Kernel Canonical Correlation Analysis based mRMR (KCCAmRMR), have been made. In the present study, a novel filter feature selection method based on the Maximal Information Coefficient (MIC) and Gram-Schmidt Orthogonalization (GSO), named Orthogonal MIC Feature Selection (OMICFS), was proposed to solve this problem. Different from other improved approaches under the max-relevance and min-redundancy criterion, in the proposed method, the MIC is used to quantify the degree of relevance between feature variables and target variable, the GSO is devoted to calculating the orthogonalized variable of a candidate feature with respect to previously selected features, and the max-relevance and min-redundancy can be indirectly optimized by maximizing the MIC relevance between the GSO orthogonalized variable and target. This orthogonalization strategy allows OMICFS to exclude the irrelevant redundancy without any additional procedures. To verify the performance, OMICFS was compared with other filter feature selection methods in terms of both classification accuracy and computational efficiency by conducting classification experiments on two types of biomedical datasets. The results showed that OMICFS outperforms the other methods in most cases. In addition, differences between these methods were analyzed, and the application of OMICFS in the mining of high-dimensional biomedical data was discussed. The Matlab code for the proposed method is available at https://github.com/lhqxinghun/bioinformatics/tree/master/OMICFS/.


Assuntos
Mineração de Dados/métodos , Processamento Eletrônico de Dados/métodos , Modelos Teóricos
15.
PLoS One ; 12(5): e0176909, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28472185

RESUMO

Human endogenous retroviruses (HERVs) encode active retroviral proteins, which may be involved in the progression of cancer and other diseases. Matrix protein (MA), in group-specific antigen genes (gag) of retroviruses, is associated with the virus envelope glycoproteins in most mammalian retroviruses and may be involved in virus particle assembly, transport and budding. However, the amount of annotated MAs in ERVs is still at a low level so far. No computational method to predict the exact start and end coordinates of MAs in gags has been proposed yet. In this paper, a computational method to identify MAs in ERVs is proposed. A divide and conquer technique was designed and applied to the conventional prediction model to acquire better results when dealing with gene sequences with various lengths. Initiation sites and termination sites were predicted separately and then combined according to their intervals. Three different algorithms were applied and compared: weighted support vector machine (WSVM), weighted extreme learning machine (WELM) and random forest (RF). G - mean (geometric mean of sensitivity and specificity) values of initiation sites and termination sites under 5-fold cross validation generated by random forest models are 0.9869 and 0.9755 respectively, highest among the algorithms applied. Our prediction models combine RF & WSVM algorithms to achieve the best prediction results. 98.4% of all the collected ERV sequences with complete MAs (125 in total) could be predicted exactly correct by the models. 94,671 HERV sequences from 118 families were scanned by the model, 104 new putative MAs were predicted in human chromosomes. Distributions of the putative MAs and optimizations of model parameters were also analyzed. The usage of our predicting method was also expanded to other retroviruses and satisfying results were acquired.


Assuntos
Biologia Computacional , Retrovirus Endógenos/metabolismo , Proteínas da Matriz Viral/metabolismo , Animais , Humanos
16.
PLoS One ; 11(10): e0165216, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27755604

RESUMO

[This corrects the article DOI: 10.1371/journal.pone.0161913.].

17.
PLoS One ; 11(8): e0161913, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27574780

RESUMO

RNase H (RNH) is a pivotal domain in retrovirus to cleave the DNA-RNA hybrid for continuing retroviral replication. The crucial role indicates that RNH is a promising drug target for therapeutic intervention. However, annotated RNHs in UniProtKB database have still been insufficient for a good understanding of their statistical characteristics so far. In this work, a computational RNH model was proposed to annotate new putative RNHs (np-RNHs) in the retroviruses. It basically predicts RNH domains through recognizing their start and end sites separately with SVM method. The classification accuracy rates are 100%, 99.01% and 97.52% respectively corresponding to jack-knife, 10-fold cross-validation and 5-fold cross-validation test. Subsequently, this model discovered 14,033 np-RNHs after scanning sequences without RNH annotations. All these predicted np-RNHs and annotated RNHs were employed to analyze the length, hydrophobicity and evolutionary relationship of RNH domains. They are all related to retroviral genera, which validates the classification of retroviruses to a certain degree. In the end, a software tool was designed for the application of our prediction model. The software together with datasets involved in this paper can be available for free download at https://sourceforge.net/projects/rhtool/files/?source=navbar.


Assuntos
Retroviridae/enzimologia , Ribonuclease H/genética , Análise de Sequência de Proteína/métodos , Simulação por Computador , Bases de Dados de Proteínas , Software , Proteínas Virais/genética
18.
Comput Biol Chem ; 61: 245-50, 2016 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-26963379

RESUMO

As a pivotal domain within envelope protein, fusion peptide (FP) plays a crucial role in pathogenicity and therapeutic intervention. Taken into account the limited FP annotations in NCBI database and absence of FP prediction software, it is urgent and desirable to develop a bioinformatics tool to predict new putative FPs (np-FPs) in retroviruses. In this work, a sequence-based FP model was proposed by combining Hidden Markov Method with similarity comparison. The classification accuracies are 91.97% and 92.31% corresponding to 10-fold and leave-one-out cross-validation. After scanning sequences without FP annotations, this model discovered 53,946 np-FPs. The statistical results on FPs or np-FPs reveal that FP is a conserved and hydrophobic domain. The FP software programmed for windows environment is available at https://sourceforge.net/projects/fptool/files/?source=navbar.


Assuntos
Modelos Biológicos , Peptídeos/metabolismo , Proteínas Recombinantes de Fusão/metabolismo , Retroviridae/metabolismo
19.
Comput Biol Chem ; 61: 221-5, 2016 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-26917277

RESUMO

Circular RNAs (circRNAs) were found more than 30 years ago, but have been treated as molecular flukes in a long time. Combining deep sequencing studies with bioinformatics technique, thousands of endogenous circRNAs have been found in mammalian cells, and some researchers have proved that several circRNAs act as competing endogenous RNAs (ceRNAs) to regulate gene expression. However, the mechanism by which the precursor mRNA to be transformed into a circular RNA or a linear mRNA is largely unknown. In this paper, we attempted to bioinformatically identify shared genomic features that might further elucidate the mechanism of formation and proposed a SVM-based model to distinguish circRNAs from non-circularized, expressed exons. Firstly, conformational and thermodynamic dinucleotide properties in the flanking introns were extracted as potential features. Secondly, two feature selection methods were applied to gain the optimal feature subset. Our 10-fold cross-validation results showed that the model can be used to distinguish circRNAs from non-circularized, expressed exons with an Sn of 0.884, Sp of 0.900, ACC of 0.892, MCC of 0.784, respectively. The identification results suggest that conformational and thermodynamic properties in the flanking introns are closely related to the formation of circRNAs. Datasets and the tool involved in this paper are all available at https://sourceforge.net/projects/predicircrnatool/files/.


Assuntos
Íntrons , RNA/química , RNA Circular , Termodinâmica
20.
Comput Biol Chem ; 62: 96-103, 2016 06.
Artigo em Inglês | MEDLINE | ID: mdl-27107687

RESUMO

Regulatory single nucleotide polymorphisms (rSNPs) in human genomes are thought to be responsible for phenotypic differences, including susceptibility to diseases and treatment outcomes, even they do not change any gene product. However, a genome-wide search for rSNPs has not been properly addressed so far. In this work, a computational method for rSNP identification is proposed. As background SNPs far outnumber rSNPs, an ensemble method is applied to handle imbalanced data, which firstly converts an unbalanced dataset into several balanced ones and then models for every balanced dataset. Two major types of features are extracted, that are sequence based features and allele-specific based features. Then random forest is applied to build the recognition model for each balanced dataset. Finally, ensemble strategies are adopted to combine the result of each model together. We have tested our method on a set of experimentally verified rSNPs, and leave-one-out cross-validation results showed that our method can achieve accuracy with sensitivity of 73.8%, specificity of 71.8% and the area under ROC curve (AUC) is 0.756. In addition, our method is threshold free and doesn't rely on data of regulatory elements, thus it will have better adaptability when facing different data scenarios. The original data and the source matlab codes involved are available at https://sourceforge.net/projects/rsnpdect/.


Assuntos
Algoritmos , Biologia Computacional/métodos , Genoma Humano/genética , Polimorfismo de Nucleotídeo Único/genética , Bases de Dados como Assunto , Humanos , Curva ROC
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA