RESUMO
Pseudouridine is an RNA modification that is widely distributed in both prokaryotes and eukaryotes, and plays a critical role in numerous biological activities. Despite its importance, the precise identification of pseudouridine sites through experimental approaches poses significant challenges, requiring substantial time and resources.Therefore, there is a growing need for computational techniques that can reliably and quickly identify pseudouridine sites from vast amounts of RNA sequencing data. In this study, we propose fuzzy kernel evidence Random Forest (FKeERF) to identify pseudouridine sites. This method is called PseU-FKeERF, which demonstrates high accuracy in identifying pseudouridine sites from RNA sequencing data. The PseU-FKeERF model selected four RNA feature coding schemes with relatively good performance for feature combination, and then input them into the newly proposed FKeERF method for category prediction. FKeERF not only uses fuzzy logic to expand the original feature space, but also combines kernel methods that are easy to interpret in general for category prediction. Both cross-validation tests and independent tests on benchmark datasets have shown that PseU-FKeERF has better predictive performance than several state-of-the-art methods. This new method not only improves the accuracy of pseudouridine site identification, but also provides a certain reference for disease control and related drug development in the future.
Assuntos
Pseudouridina , Algoritmo Florestas Aleatórias , Pseudouridina/genética , RNA/genética , Sequência de BasesRESUMO
BACKGROUND: Studying the composition rules and evolution mechanisms of genome sequences are core issues in the post-genomic era, and k-mer spectrum analysis of genome sequences is an effective means to solve this problem. RESULT: We divided total 8-mers of genome sequences into 16 kinds of XY-type due to XY dinucleotides number in 8-mers. Previous works explored that the independent unimodal distributions observed only in three CG-type 8-mer spectra, while non-CG type 8-mer spectra have not the universal phenomenon from prokaryotes to eukaryotes. On this basis, we analyzed the distribution variation of non-CG type 8-mer spectra across 889 animal genome sequences. Following the evolutionary order of animals from primitive to more complex, we found that the spectrum distributions gradually transition from unimodal to tri-modal. The relative distance from the average frequency of each non-CG type 8-mers to the center frequency is different within a species and among different species. For the 8-mers contain CG dinucleotides, we further divided these into 16 subsets, where each 8-mer contains both CG and XY dinucleotides, called XY1_CG1 subsets. We found that the separability values of XY1_CG1 spectra are closely related to the evolution and specificity of animals. Considering the constraint of Chargaff's second parity rule, we finally obtained 10 separability values as the feature set to characterize the evolution state of genome sequences. In order to verify the rationality of the feature set, we used 14 common classification algorithms to perform binary classification tests. The results showed that the accuracy (Acc) ranged between 98.70% and 83.88% among birds, other vertebrates and mammals. CONCLUSION: We proposed a credible feature set to characterizes the evolution state of genomes and obtained satisfied results by the feature set on large scale classification of animals.
Assuntos
Evolução Molecular , Genoma , Animais , Genômica/métodos , Algoritmos , Análise de Sequência de DNA/métodosRESUMO
BACKGROUND: Exploring evolution regularities of genome sequences and constructing more objective species evolution relationships at the genomic level are high-profile topics. Based on the evolution mechanism of genome sequences proposed in our previous research, we found that only the 8-mers containing CG or TA dinucleotides correlate directly with the evolution of genome sequences, and the relative frequency rather than the actual frequency of these 8-mers is more suitable to characterize the evolution of genome sequences. RESULT: Therefore, two types of feature sets were obtained, they are the relative frequency sets of CG1 + CG2 8-mers and TA1 + TA2 8-mers. The evolution relationships of mammals and reptiles were constructed by the relative frequency set of CG1 + CG2 8-mers, and two types of evolution relationships of insects were constructed by the relative frequency sets of CG1 + CG2 8-mers and TA1 + TA2 8-mers respectively. Through comparison and analysis, we found that evolution relationships are consistent with the known conclusions. According to the evolution mechanism, we considered that the evolution relationship constructed by CG1 + CG2 8-mers reflects the evolution state of genome sequences in current time, and the evolution relationship constructed by TA1 + TA2 8-mers reflects the evolution state in the early stage. CONCLUSION: Our study provides objective feature sets in constructing evolution relationships at the genomic level.
Assuntos
Genoma , Genômica , Animais , Mamíferos/genéticaRESUMO
Vocal emotion recognition (VER) in natural speech, often referred to as speech emotion recognition (SER), remains challenging for both humans and computers. Applied fields including clinical diagnosis and intervention, social interaction research or Human Computer Interaction (HCI) increasingly benefit from efficient VER algorithms. Several feature sets were used with machine-learning (ML) algorithms for discrete emotion classification. However, there is no consensus for which low-level-descriptors and classifiers are optimal. Therefore, we aimed to compare the performance of machine-learning algorithms with several different feature sets. Concretely, seven ML algorithms were compared on the Berlin Database of Emotional Speech: Multilayer Perceptron Neural Network (MLP), J48 Decision Tree (DT), Support Vector Machine with Sequential Minimal Optimization (SMO), Random Forest (RF), k-Nearest Neighbor (KNN), Simple Logistic Regression (LOG) and Multinomial Logistic Regression (MLR) with 10-fold cross validation using four openSMILE feature sets (i.e., IS-09, emobase, GeMAPS and eGeMAPS). Results indicated that SMO, MLP and LOG show better performance (reaching to 87.85%, 84.00% and 83.74% accuracies, respectively) compared to RF, DT, MLR and KNN (with minimum 73.46%, 53.08%, 70.65% and 58.69% accuracies, respectively). Overall, the emobase feature set performed best. We discuss the implications of these findings for applications in diagnosis, intervention or HCI.
Assuntos
Aprendizado de Máquina , Fala , Algoritmos , Emoções , Humanos , Redes Neurais de Computação , Máquina de Vetores de SuporteRESUMO
In the case of bladder cancer, carcinoma in situ (CIS) is known to have poor diagnosis. However, there are not enough studies that examine the biomarkers relevant to CIS development. Omics experiments generate data with tens of thousands of descriptive variables, e.g., gene expression levels. Often, many of these descriptive variables are identified as somehow relevant, resulting in hundreds or thousands of relevant variables for building models or for further data analysis. We analyze one such dataset describing patients with bladder cancer, mostly non-muscle-invasive (NMIBC), and propose a novel approach to feature selection. This approach returns high-quality features for prediction and yet allows interpretability as well as a certain level of insight into the analyzed data. As a result, we obtain a small set of seven of the most-useful biomarkers for diagnostics. They can also be used to build tests that avoid the costly and time-consuming existing methods. We summarize the current biological knowledge of the chosen biomarkers and contrast it with our findings.
Assuntos
Carcinoma in Situ , Neoplasias da Bexiga Urinária , Biomarcadores , Biomarcadores Tumorais/genética , Progressão da Doença , Humanos , Invasividade Neoplásica , Bexiga Urinária/patologia , Neoplasias da Bexiga Urinária/diagnóstico , Neoplasias da Bexiga Urinária/genética , Neoplasias da Bexiga Urinária/patologiaRESUMO
Automated segmentation of brain tumors is a difficult procedure due to the variability and blurred boundary of the lesions. In this study, we propose an automated model based on Bendlet transform and improved Chan-Vese (CV) model for brain tumor segmentation. Since the Bendlet system is based on the principle of sparse approximation, Bendlet transform is applied to describe the images and map images to the feature space and, thereby, first obtain the feature set. This can help in effectively exploring the mapping relationship between brain lesions and normal tissues, and achieving multi-scale and multi-directional registration. Secondly, the SSIM region detection method is proposed to preliminarily locate the tumor region from three aspects of brightness, structure, and contrast. Finally, the CV model is solved by the Hermite-Shannon-Cosine wavelet homotopy method, and the boundary of the tumor region is more accurately delineated by the wavelet transform coefficient. We randomly selected some cross-sectional images to verify the effectiveness of the proposed algorithm and compared with CV, Ostu, K-FCM, and region growing segmentation methods. The experimental results showed that the proposed algorithm had higher segmentation accuracy and better stability.
RESUMO
When the use of optical images is not practical due to cloud cover, Synthetic Aperture Radar (SAR) imagery is a preferred alternative for monitoring coastal wetlands because it is unaffected by weather conditions. Polarimetric SAR (PolSAR) enables the detection of different backscattering mechanisms and thus has potential applications in land cover classification. Gaofen-3 (GF-3) is the first Chinese civilian satellite with multi-polarized C-band SAR imaging capability. Coastal wetland classification with GF-3 polarimetric SAR imagery has attracted increased attention in recent years, but it remains challenging. The aim of this study was to classify land cover in coastal wetlands using an object-oriented random forest algorithm on the basis of GF-3 polarimetric SAR imagery. First, a set of 16 commonly used SAR features was extracted. Second, the importance of each SAR feature was calculated, and the optimal polarimetric features were selected for wetland classification by combining random forest (RF) with sequential backward selection (SBS). Finally, the proposed algorithm was utilized to classify different land cover types in the Yancheng Coastal Wetlands. The results show that the most important parameters for wetland classification in this study were Shannon entropy, Span and orientation randomness, combined with features derived from Yamaguchi decomposition, namely, volume scattering, double scattering, surface scattering and helix scattering. When the object-oriented RF classification approach was used with the optimal feature combination, different land cover types in the study area were classified, with an overall accuracy of up to 92%.
Assuntos
Radar , Áreas Alagadas , Algoritmos , Monitoramento AmbientalRESUMO
This paper presents a high precision and low computational complexity premature ventricular contraction (PVC) assessment method for the ECG human-machine interface device. The original signals are preprocessed by integrated filters. Then, R points and surrounding feature points are determined by corresponding detection algorithms. On this basis, a complex feature set and feature matrices are obtained according to the position feature points. Finally, an exponential Minkowski distance method is proposed for PVC recognition. Both public dataset and clinical experiments were utilized to verify the effectiveness and superiority of the proposed method. The results show that our R peak detection algorithm can substantially reduce the error rate, and obtained 98.97% accuracy for QRS complexes. Meanwhile, the accuracy of PVC recognition was 98.69% for the MIT-BIH database and 98.49% for clinical tests. Moreover, benefiting from the lightweight of our model, it can be easily applied to portable healthcare devices for human-computer interaction.
Assuntos
Diagnóstico por Computador/métodos , Reconhecimento Automatizado de Padrão/métodos , Complexos Ventriculares Prematuros/diagnóstico , Algoritmos , Bases de Dados Factuais , Eletrocardiografia/métodos , HumanosRESUMO
Point cloud classification is an essential requirement for effectively utilizing point cloud data acquired by Terrestrial laser scanning (TLS). Neighborhood selection, feature selection and extraction, and classification of points based on the respective features constitute the commonly used workflow of point cloud classification. Feature selection and extraction has been the focus of many studies, and the choice of different features has had a great impact on classification results. In previous studies, geometric features were widely used for TLS point cloud classification, and only a few studies investigated the potential of both intensity and color on classification using TLS point cloud. In this paper, the geometric features, color features, and intensity features were extracted based on a supervoxel neighborhood. In addition, the original intensity was also corrected for range effect, which is why the corrected intensity features were also extracted. The different combinations of these features were tested on four real-world data sets. Experimental results demonstrate that both color and intensity features can complement the geometric features to help improve the classification results. Furthermore, the combination of geometric features, color features, and corrected intensity features together achieves the highest accuracy in our test.
RESUMO
Characterization and identification of recombination hotspots provide important insights into the mechanism of recombination and genome evolution. In contrast with existing sequence-based models for predicting recombination hotspots which were defined in a ORF-based manner, here, we first defined recombination hot/cold spots based on public high-resolution Spo11-oligo-seq data, then characterized them in terms of DNA sequence and epigenetic marks, and finally presented classifiers to identify hotspots. We found that, in addition to some previously discovered DNA-based features like GC-skew, recombination hotspots in yeast can also be characterized by some remarkable features associated with DNA physical properties and shape. More importantly, by using DNA-based features and several epigenetic marks, we built several classifiers to discriminate hotspots from coldspots, and found that SVM classifier performs the best with an accuracy of â¼92%, which is also the highest among the models in comparison. Feature importance analysis combined with prediction results show that epigenetic marks and variation of sequence-based features along the hotspots contribute dominantly to hotspot identification. By using incremental feature selection method, an optimal feature subset that consists of much less features was obtained without sacrificing prediction accuracy.
RESUMO
The 2016 World Health Organization brain tumor classification is based on genomic and molecular profile of tumor tissue. These characteristics have improved understanding of the brain tumor and played an important role in treatment planning and prognostication. There is an ongoing effort to develop noninvasive imaging techniques that provide insight into tissue characteristics at the cellular and molecular levels. This article focuses on the molecular characteristics of gliomas, transcriptomic subtypes, and radiogenomic studies using semantic and radiomic features. The limitations and future directions of radiogenomics as a standalone diagnostic tool also are discussed.
Assuntos
Neoplasias Encefálicas/diagnóstico por imagem , Neoplasias Encefálicas/genética , Diagnóstico por Imagem/métodos , Glioma/diagnóstico por imagem , Glioma/genética , Genômica por Imageamento/métodos , Encéfalo/diagnóstico por imagem , HumanosRESUMO
The goal of this study was to examine the optimal strategies for the recognition of gait phase based on surface electromyogram (sEMG) of leg muscles while children with cerebral palsy (CP) walked on a treadmill. Ten children with CP were recruited to participate in this study. sEMG from eight leg muscles and leg position signals were recorded while subjects walked on a treadmill. The position signals of left and right legs were used to develop a five gait sub-phases classifier, i.e., mid stance, terminal stance, pre-swing, mid swing, and terminal swing. Seven feature sets of sEMG signals were tested in recognizing the five gait sub-phases of children with CP. Results from this study indicated that the recognition performance of mean absolute value and zero crossing was better than that with other feature sets when using support vector machine (average classification accuracy was 89.40%). Further, we found that the performance of gait phase recognition is relatively better in pre-swing than other sub-phases, and the performance of gait phase recognition is relatively poorer in mid-swing than other sub-phases. Results from this study may be used to develop an intention-driven robotic gait training system/paradigm for assisting walking in children with CP through robotic training.
Assuntos
Paralisia Cerebral/fisiopatologia , Marcha/fisiologia , Adolescente , Criança , Eletromiografia , Humanos , Perna (Membro) , Músculo Esquelético/fisiologiaRESUMO
We present a skin lesion diagnosis system that segments the lesion and classifies it as melanoma or nonmelanoma. The proposed system is capable to deal with skin lesion images acquired by standard consumer-grade cameras and dermascopes. In order to suppress the image artifacts and enhance the lesion area, we propose an illumination correction strategy which consists of filtering in frequency and spatial domains. We introduce a hybrid model for lesion segmentation, which forms texture segments of the illumination corrected image using a factorization technique. Then based on the texture distinctiveness of the corrected and the texture segmented images, the saliency maps are computed, which are combined to decide lesion texture segments. In order to classify the segmented lesion, we propose a multimodal feature set composed of texture-, shape-, and color-based features. Classification performance of the multimodal features is evaluated using support vector machine, decision trees, and Mahalanobis distance classifiers. We evaluate the performance of the proposed system qualitatively and quantitatively. For the consumer-grade camera skin images dataset and ISIC 2017 dermascopic images dataset, the average segmentation accuracies are 98.4% and 95.4%, respectively; the classification accuracies are 98.06% and 93.95%, respectively.
RESUMO
Complicated stages of diabetes are the major cause of Diabetic Retinopathy (DR) and no symptoms appear at the initial stage of DR. At the early stage diagnosis of DR, screening and treatment may reduce vision harm. In this work, an automated technique is applied for detection and classification of DR. A local contrast enhancement method is used on grayscale images to enhance the region of interest. An adaptive threshold method with mathematical morphology is used for the accurate lesions region segmentation. After that, the geometrical and statistical features are fused for better classification. The proposed method is validated on DIARETDB1, E-ophtha, Messidor, and local data sets with different metrics such as area under the curve (AUC) and accuracy (ACC).
Assuntos
Automação Laboratorial/métodos , Retinopatia Diabética/diagnóstico , Retinopatia Diabética/patologia , Imagem Óptica/métodos , Índice de Gravidade de Doença , Biometria/métodos , HumanosRESUMO
Chronic kidney disease (CKD) is one of the life-threatening diseases. Early detection and proper management are solicited for augmenting survivability. As per the UCI data set, there are 24 attributes for predicting CKD or non-CKD. At least there are 16 attributes need pathological investigations involving more resources, money, time, and uncertainties. The objective of this work is to explore whether we can predict CKD or non-CKD with reasonable accuracy using less number of features. An intelligent system development approach has been used in this study. We attempted one important feature selection technique to discover reduced features that explain the data set much better. Two intelligent binary classification techniques have been adopted for the validity of the reduced feature set. Performances were evaluated in terms of four important classification evaluation parameters. As suggested from our results, we may more concentrate on those reduced features for identifying CKD and thereby reduces uncertainty, saves time, and reduces costs.
RESUMO
BACKGROUND: Molecular descriptors have been widely used to predict biological activities and physicochemical properties or to analyze chemical libraries on the basis of similarity. Although fingerprints and properties are generally used as descriptors, neither is perfect for these purposes. A fingerprint can distinguish between molecules, whereas a property may not do the same in certain cases, and vice versa. When the number of the training set is especially small, the construction of good predictive models is difficult. Herein, a novel descriptor integrating mutually compensating fingerprint and property characteristics is described. The format of this descriptor is not conventional. It has two dimensions with variable length in one dimension to represent one molecule. This format is not acceptable for any machine learning methods. Therefore the distance between molecules has been newly defined for application to machine learning techniques. The evaluation of this descriptor, as applied to classification tasks, was performed using a support vector machine after the features of the descriptor had been optimized by a genetic algorithm. RESULTS: Because the optimizing feature is time-intensive due to the complicated calculation of distances between molecules, the optimization was forced to stop before it was completed. As a result, no remarkable improvement was observed in the classification results for the new descriptor compared with those for other descriptors in any evaluation set used in this work. However, extremely low accuracies were also not found for any set. CONCLUSIONS: The novel descriptor proposed in this work can potentially be used to make highly accurate predictive models. This new concept in descriptors is expected to be useful for developing novel predictive methods with quick training and high accuracy.
RESUMO
Automatic and accurate identification of rolling bearing fault categories, especially for the fault severities and compound faults, is a challenge in rotating machinery fault diagnosis. For this purpose, a novel method called adaptive deep belief network (DBN) with dual-tree complex wavelet packet (DTCWPT) is developed in this paper. DTCWPT is used to preprocess the vibration signals to refine the fault characteristics information, and an original feature set is designed from each frequency-band signal of DTCWPT. An adaptive DBN is constructed to improve the convergence rate and identification accuracy with multiple stacked adaptive restricted Boltzmann machines (RBMs). The proposed method is applied to the fault diagnosis of rolling bearings. The results confirm that the proposed method is more effective than the existing methods.
RESUMO
This chapter introduces a new method for knowledge extraction from databases for the purpose of finding a discriminative set of features that is also a robust set for within-class classification. Our method is generic and we introduce it here in the field of breast cancer diagnosis from digital mammography data. The mathematical formalism is based on a generalization of the k-Feature Set problem called (α, ß)-k-Feature Set problem, introduced by Cotta and Moscato (J Comput Syst Sci 67(4):686-690, 2003). This method proceeds in two steps: first, an optimal (α, ß)-k-feature set of minimum cardinality is identified and then, a set of classification rules using these features is obtained. We obtain the (α, ß)-k-feature set in two phases; first a series of extremely powerful reduction techniques, which do not lose the optimal solution, are employed; and second, a metaheuristic search to identify the remaining features to be considered or disregarded. Two algorithms were tested with a public domain digital mammography dataset composed of 71 malignant and 75 benign cases. Based on the results provided by the algorithms, we obtain classification rules that employ only a subset of these features.