Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 15 de 15
Filter
1.
PLoS One ; 18(4): e0284560, 2023.
Article in English | MEDLINE | ID: mdl-37079543

ABSTRACT

In this paper, we create EMIR, the first-ever Music Information Retrieval dataset for Ethiopian music. EMIR is freely available for research purposes and contains 600 sample recordings of Orthodox Tewahedo chants, traditional Azmari songs and contemporary Ethiopian secular music. Each sample is classified by five expert judges into one of four well-known Ethiopian Kiñits, Tizita, Bati, Ambassel and Anchihoye. Each Kiñit uses its own pentatonic scale and also has its own stylistic characteristics. Thus, Kiñit classification needs to combine scale identification with genre recognition. After describing the dataset, we present the Ethio Kiñits Model (EKM), based on VGG, for classifying the EMIR clips. In Experiment 1, we investigated whether Filterbank, Mel-spectrogram, Chroma, or Mel-frequency Cepstral coefficient (MFCC) features work best for Kiñit classification using EKM. MFCC was found to be superior and was therefore adopted for Experiment 2, where the performance of EKM models using MFCC was compared using three different audio sample lengths. 3s length gave the best results. In Experiment 3, EKM and four existing models were compared on the EMIR dataset: AlexNet, ResNet50, VGG16 and LSTM. EKM was found to have the best accuracy (95.00%) as well as the fastest training time. However, the performance of VGG16 (93.00%) was found not to be significantly worse (P < 0.01). We hope this work will encourage others to explore Ethiopian music and to experiment with other models for Kiñit classification.


Subject(s)
Music , Singing , Humans , Benchmarking/classification , Ethiopia , Datasets as Topic/classification
2.
J Acad Nutr Diet ; 121(12): 2549-2559.e1, 2021 12.
Article in English | MEDLINE | ID: mdl-33903081

ABSTRACT

Using real-world data from the Academy of Nutrition and Dietetics Health Informatics Infrastructure, we use state-of-the-art clustering techniques to identify 2 phenotypes characterizing the episodes of nutrition care observed in the National Quality Improvement (NQI) registry data set. The 2 phenotypes identified from recorded Nutrition Care Process data in the NQI exhibit a strong correspondence with the clinical expertise of registered dietitian nutritionists. For one of these phenotypes, it was possible to implement state-of-the-art classification techniques to predict the nutrition problem-resolution status of an episode of care. Prediction results show that the assessment of nutrition history, number of recorded visits in the episode, and use of nutrition counseling interventions were significantly and positively correlated with problem resolution. Meanwhile, evaluations of nutrition history that were not within the desired ranges were significantly and negatively correlated with problem resolution. Finally, we assess the usefulness of the current NQI data set and data model for supporting the application of contemporary machine learning methods to the data set. We also suggest ways of enhancing the NQI since registered dietitian nutritionists are encouraged to continue to contribute patient cases in this and other registry nutrition studies.


Subject(s)
Datasets as Topic/classification , Dietetics/statistics & numerical data , Episode of Care , Machine Learning , Quality Improvement , Academies and Institutes , Humans , Medical Informatics
3.
J Med Syst ; 45(4): 45, 2021 Feb 23.
Article in English | MEDLINE | ID: mdl-33624190

ABSTRACT

We present a protocol for integrating two types of biological data - clinical and molecular - for more effective classification of patients with cancer. The proposed approach is a hybrid between early and late data integration strategy. In this hybrid protocol, the set of informative clinical features is extended by the classification results based on molecular data sets. The results are then treated as new synthetic variables. The hybrid protocol was applied to METABRIC breast cancer samples and TCGA urothelial bladder carcinoma samples. Various data types were used for clinical endpoint prediction: clinical data, gene expression, somatic copy number aberrations, RNA-Seq, methylation, and reverse phase protein array. The performance of the hybrid data integration was evaluated with a repeated cross validation procedure and compared with other methods of data integration: early integration and late integration via super learning. The hybrid method gave similar results to those obtained by the best of the tested variants of super learning. What is more, the hybrid method allowed for further sensitivity analysis and recursive feature elimination, which led to compact predictive models for cancer clinical endpoints. For breast cancer, the final model consists of eight clinical variables and two synthetic features obtained from molecular data. For urothelial bladder carcinoma, only two clinical features and one synthetic variable were necessary to build the best predictive model. We have shown that the inclusion of the synthetic variables based on the RNA expression levels and copy number alterations can lead to improved quality of prognostic tests. Thus, it should be considered for inclusion in wider medical practice.


Subject(s)
Algorithms , Data Management/methods , Datasets as Topic/classification , Databases, Chemical
6.
PLoS One ; 13(12): e0208433, 2018.
Article in English | MEDLINE | ID: mdl-30543662

ABSTRACT

Ordinal categorical responses are frequently collected in survey studies, human medicine, and animal and plant improvement programs, just to mention a few. Errors in this type of data are neither rare nor easy to detect. These errors tend to bias the inference, reduce the statistical power and ultimately the efficiency of the decision-making process. Contrarily to the binary situation where misclassification occurs between two response classes, noise in ordinal categorical data is more complex due to the increased number of categories, diversity and asymmetry of errors. Although several approaches have been presented for dealing with misclassification in binary data, only limited practical methods have been proposed to analyze noisy categorical responses. A latent variable model implemented within a Bayesian framework was proposed to analyze ordinal categorical data subject to misclassification using simulated and real datasets. The simulated scenario consisted of a discrete response with three categories and a symmetric error rate of 5% between any two classes. The real data consisted of calving ease records of beef cows. Using real and simulated data, ignoring misclassification resulted in substantial bias in the estimation of genetic parameters and reduction of the accuracy of predicted breeding values. Using our proposed approach, a significant reduction in bias and increase in accuracy ranging from 11% to 17% was observed. Furthermore, most of the misclassified observations (in the simulated data) were identified with a substantially higher probability. Similar results were observed for a scenario with asymmetric misclassification. While the extension to traits with more categories between adjacent classes is straightforward, it could be computationally costly. For traits with high heritability, the performance of the methodology would be expected to improve.


Subject(s)
Breeding/statistics & numerical data , Cattle , Models, Statistical , Animals , Bayes Theorem , Bias , Body Weight/physiology , Breeding/methods , Cattle/classification , Cattle/genetics , Datasets as Topic/classification , Datasets as Topic/statistics & numerical data , Female , Genetic Association Studies/statistics & numerical data , Genetic Association Studies/veterinary , Markov Chains , Meat/statistics & numerical data , Parturition/physiology , Phenotype , Physical Fitness , Pregnancy , Quantitative Trait, Heritable
7.
Comput Biol Chem ; 67: 92-101, 2017 Apr.
Article in English | MEDLINE | ID: mdl-28064045

ABSTRACT

Dimension reduction is a crucial technique in machine learning and data mining, which is widely used in areas of medicine, bioinformatics and genetics. In this paper, we propose a two-stage local dimension reduction approach for classification on microarray data. In first stage, a new L1-regularized feature selection method is defined to remove irrelevant and redundant features and to select the important features (biomarkers). In the next stage, PLS-based feature extraction is implemented on the selected features to extract synthesis features that best reflect discriminating characteristics for classification. The suitability of the proposal is demonstrated in an empirical study done with ten widely used microarray datasets, and the results show its effectiveness and competitiveness compared with four state-of-the-art methods. The experimental results on St Jude dataset shows that our method can be effectively applied to microarray data analysis for subtype prediction and the discovery of gene coexpression.


Subject(s)
Data Mining/methods , Datasets as Topic/classification , Microarray Analysis , Data Mining/statistics & numerical data , Linear Models
8.
Neural Netw ; 86: 69-79, 2017 Feb.
Article in English | MEDLINE | ID: mdl-27890606

ABSTRACT

In this paper, we extend our previous work on the Enhanced Fuzzy Min-Max (EFMM) neural network by introducing a new hyperbox selection rule and a pruning strategy to reduce network complexity and improve classification performance. Specifically, a new k-nearest hyperbox expansion rule (for selection of a new winning hyperbox) is first introduced to reduce the network complexity by avoiding the creation of too many small hyperboxes within the vicinity of the winning hyperbox. A pruning strategy is then deployed to further reduce the network complexity in the presence of noisy data. The effectiveness of the proposed network is evaluated using a number of benchmark data sets. The results compare favorably with those from other related models. The findings indicate that the newly introduced hyperbox winner selection rule coupled with the pruning strategy are useful for undertaking pattern classification problems.


Subject(s)
Datasets as Topic/classification , Fuzzy Logic , Neural Networks, Computer
9.
Neural Netw ; 81: 59-71, 2016 Sep.
Article in English | MEDLINE | ID: mdl-27351107

ABSTRACT

This paper studies the learning and generalization performances of pseudo-inverse linear discriminant (PILDs) based on the processing minimum sum-of-squared error (MS(2)E) and the targeting overall classification accuracy (OCA) criterion functions. There is little practicable significance to prove the equivalency between a PILD with the desired outputs in reverse proportion to the number of class samples and an FLD with the totally projected mean thresholds. When the desired outputs of each class are assigned a fixed value, a PILD is partly equal to an FLD. With the customarily desired outputs {1, -1}, a practicable threshold is acquired, which is only related to sample sizes. If the desired outputs of each sample are changeable, a PILD has nothing in common with an FLD. The optimal threshold may thus be singled out from multiple empirical ones related to sizes and distributed regions. Depending upon the processing MS(2)E criteria and the actually algebraic distances, an iterative learning strategy of PILD is proposed, the outstanding advantages of which are with limited epoch, without learning rate and divergent risk. Enormous experimental results for the benchmark datasets have verified that the iterative PILDs with optimal thresholds have good learning and generalization performances, and even reach the top OCAs for some datasets among the existing classifiers.


Subject(s)
Datasets as Topic/classification , Linear Models , Algorithms , Databases, Factual/classification , Humans
10.
Article in English | MEDLINE | ID: mdl-27074759

ABSTRACT

Decision trees are renowned in the computational chemistry and machine learning communities for their interpretability. Their capacity and usage are somewhat limited by the fact that they normally work on categorical data. Improvements to known decision tree algorithms are usually carried out by increasing and tweaking parameters, as well as the post-processing of the class assignment. In this work we attempted to tackle both these issues. Firstly, conditional mutual information was used as the criterion for selecting the attribute on which to split instances. The algorithm performance was compared with the results of C4.5 (WEKA's J48) using default parameters and no restrictions. Two datasets were used for this purpose, DrugBank compounds for HRH1 binding prediction and Traditional Chinese Medicine formulation predicted bioactivities for therapeutic class annotation. Secondly, an automated binning method for continuous data was evaluated, namely Scott's normal reference rule, in order to allow any decision tree to easily handle continuous data. This was applied to all approved drugs in DrugBank for predicting the RDKit SLogP property, using the remaining RDKit physicochemical attributes as input.


Subject(s)
Algorithms , Decision Trees , Datasets as Topic/classification , Datasets as Topic/standards , Medicine, Chinese Traditional/methods , Pharmaceutical Preparations/classification
12.
Stud Health Technol Inform ; 216: 559-63, 2015.
Article in English | MEDLINE | ID: mdl-26262113

ABSTRACT

Clinical data warehouses often contain analogous data from disparate sources, resulting in heterogeneous formats and semantics. We have developed an approach that attempts to represent such phenotypic data in its most atomic form to facilitate aggregation. We illustrate this approach with human blood antigen typing (ABO-Rh) data drawn from the National Institutes of Health's Biomedical Translational Research Information System (BTRIS). In applying the method to actual patient data, we discovered a 2% incidence of changed blood types. We believe our approach can be applied to any institution's data to obtain comparable patient phenotypes. The actual discrepant blood type data will form the basis for a future study of the reasons for blood typing variation.


Subject(s)
ABO Blood-Group System , Blood Grouping and Crossmatching/statistics & numerical data , Blood Grouping and Crossmatching/standards , Datasets as Topic/statistics & numerical data , Electronic Health Records/statistics & numerical data , Health Information Systems/statistics & numerical data , Data Mining/methods , Datasets as Topic/classification , Datasets as Topic/standards , Electronic Health Records/classification , Electronic Health Records/standards , Health Information Systems/classification , Health Information Systems/standards , Humans , Medical Record Linkage/methods , Medical Record Linkage/standards , National Institutes of Health (U.S.) , Natural Language Processing , Phenotype , Reference Values , United States , Vocabulary, Controlled
13.
Article in German | MEDLINE | ID: mdl-26077872

ABSTRACT

A variety of rich terminology systems, such as thesauri, classifications, nomenclatures and ontologies support information and knowledge processing in health care and biomedical research. Nevertheless, human language, manifested as individually written texts, persists as the primary carrier of information, in the description of disease courses or treatment episodes in electronic medical records, and in the description of biomedical research in scientific publications. In the context of the discussion about big data in biomedicine, we hypothesize that the abstraction of the individuality of natural language utterances into structured and semantically normalized information facilitates the use of statistical data analytics to distil new knowledge out of textual data from biomedical research and clinical routine. Computerized human language technologies are constantly evolving and are increasingly ready to annotate narratives with codes from biomedical terminology. However, this depends heavily on linguistic and terminological resources. The creation and maintenance of such resources is labor-intensive. Nevertheless, it is sensible to assume that big data methods can be used to support this process. Examples include the learning of hierarchical relationships, the grouping of synonymous terms into concepts and the disambiguation of homonyms. Although clear evidence is still lacking, the combination of natural language technologies, semantic resources, and big data analytics is promising.


Subject(s)
Biological Ontologies/organization & administration , Datasets as Topic/classification , Datasets as Topic/statistics & numerical data , Natural Language Processing , Terminology as Topic , Vocabulary, Controlled , Data Accuracy , Germany , Information Storage and Retrieval/standards , Medical Record Linkage/standards
15.
ScientificWorldJournal ; 2014: 314728, 2014.
Article in English | MEDLINE | ID: mdl-24707201

ABSTRACT

Bankruptcy prediction is a vast area of finance and accounting whose importance lies in the relevance for creditors and investors in evaluating the likelihood of getting into bankrupt. As companies become complex, they develop sophisticated schemes to hide their real situation. In turn, making an estimation of the credit risks associated with counterparts or predicting bankruptcy becomes harder. Evolutionary algorithms have shown to be an excellent tool to deal with complex problems in finances and economics where a large number of irrelevant features are involved. This paper provides a methodology for feature selection in classification of bankruptcy data sets using an evolutionary multiobjective approach that simultaneously minimise the number of features and maximise the classifier quality measure (e.g., accuracy). The proposed methodology makes use of self-adaptation by applying the feature selection algorithm while simultaneously optimising the parameters of the classifier used. The methodology was applied to four different sets of data. The obtained results showed the utility of using the self-adaptation of the classifier.


Subject(s)
Algorithms , Bankruptcy/classification , Datasets as Topic/classification , Pattern Recognition, Automated/methods , Bankruptcy/trends , Datasets as Topic/trends , Forecasting , Humans , Pattern Recognition, Automated/trends
SELECTION OF CITATIONS
SEARCH DETAIL
...