Búsqueda | Portal de Búsqueda de la BVS España

iSentenizer-µ: multilingual sentence boundary detection model.

Wong, Derek F; Chao, Lidia S; Zeng, Xiaodong.

ScientificWorldJournal ; 2014: 196574, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-24883358

RESUMEN

Sentence boundary detection (SBD) system is normally quite sensitive to genres of data that the system is trained on. The genres of data are often referred to the shifts of text topics and new languages domains. Although new detection models can be retrained for different languages or new text genres, previous model has to be thrown away and the creation process has to be restarted from scratch. In this paper, we present a multilingual sentence boundary detection system (iSentenizer-µ) for Danish, German, English, Spanish, Dutch, French, Italian, Portuguese, Greek, Finnish, and Swedish languages. The proposed system is able to detect the sentence boundaries of a mixture of different text genres and languages with high accuracy. We employ i (+)Learning algorithm, an incremental tree learning architecture, for constructing the system. iSentenizer-µ, under the incremental learning framework, is adaptable to text of different topics and Roman-alphabet languages, by merging new data into existing model to learn the new knowledge incrementally by revision instead of retraining. The system has been extensively evaluated on different languages and text genres and has been compared against two state-of-the-art SBD systems, Punkt and MaxEnt. The experimental results show that the proposed system outperforms the other systems on all datasets.

Asunto(s)

Procesamiento de Lenguaje Natural , Árboles de Decisión , Lenguaje , Lingüística , Modelos Teóricos , Traducción

Constructing better classifier ensemble based on weighted accuracy and diversity measure.

Zeng, Xiaodong; Wong, Derek F; Chao, Lidia S.

ScientificWorldJournal ; 2014: 961747, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-24672402

RESUMEN

A weighted accuracy and diversity (WAD) method is presented, a novel measure used to evaluate the quality of the classifier ensemble, assisting in the ensemble selection task. The proposed measure is motivated by a commonly accepted hypothesis; that is, a robust classifier ensemble should not only be accurate but also different from every other member. In fact, accuracy and diversity are mutual restraint factors; that is, an ensemble with high accuracy may have low diversity, and an overly diverse ensemble may negatively affect accuracy. This study proposes a method to find the balance between accuracy and diversity that enhances the predictive ability of an ensemble for unknown data. The quality assessment for an ensemble is performed such that the final score is achieved by computing the harmonic mean of accuracy and diversity, where two weight parameters are used to balance them. The measure is compared to two representative measures, Kappa-Error and GenDiv, and two threshold measures that consider only accuracy or diversity, with two heuristic search algorithms, genetic algorithm, and forward hill-climbing algorithm, in ensemble selection tasks performed on 15 UCI benchmark datasets. The empirical results demonstrate that the WAD measure is superior to others in most cases.

Asunto(s)

Modelos Teóricos , Algoritmos , Humanos

Unsupervised chunking based on graph propagation from bilingual corpus.

Zhu, Ling; Wong, Derek F; Chao, Lidia S.

ScientificWorldJournal ; 2014: 401943, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-24772017

RESUMEN

This paper presents a novel approach for unsupervised shallow parsing model trained on the unannotated Chinese text of parallel Chinese-English corpus. In this approach, no information of the Chinese side is applied. The exploitation of graph-based label propagation for bilingual knowledge transfer, along with an application of using the projected labels as features in unsupervised model, contributes to a better performance. The experimental comparisons with the state-of-the-art algorithms show that the proposed approach is able to achieve impressive higher accuracy in terms of F-score.

Asunto(s)

Inteligencia Artificial , Lenguaje , Modelos Teóricos , Algoritmos

A relationship: word alignment, phrase table, and translation quality.

Tian, Liang; Wong, Derek F; Chao, Lidia S; Oliveira, Francisco.

ScientificWorldJournal ; 2014: 438106, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-24883402

RESUMEN

In the last years, researchers conducted several studies to evaluate the machine translation quality based on the relationship between word alignments and phrase table. However, existing methods usually employ ad-hoc heuristics without theoretical support. So far, there is no discussion from the aspect of providing a formula to describe the relationship among word alignments, phrase table, and machine translation performance. In this paper, on one hand, we focus on formulating such a relationship for estimating the size of extracted phrase pairs given one or more word alignment points. On the other hand, a corpus-motivated pruning technique is proposed to prune the default large phrase table. Experiment proves that the deduced formula is feasible, which not only can be used to predict the size of the phrase table, but also can be a valuable reference for investigating the relationship between the translation performance and phrase tables based on different links of word alignment. The corpus-motivated pruning results show that nearly 98% of phrases can be reduced without any significant loss in translation quality.

Asunto(s)

Procesamiento de Lenguaje Natural , Traducción , Lenguaje , Lingüística/métodos

A systematic comparison of data selection criteria for SMT domain adaptation.

Wang, Longyue; Wong, Derek F; Chao, Lidia S; Lu, Yi; Xing, Junwen.

ScientificWorldJournal ; 2014: 745485, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-24683356

RESUMEN

Data selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. This paper performs an in-depth analysis of three different sentence selection techniques. The first one is cosine tf-idf, which comes from the realm of information retrieval (IR). The second is perplexity-based approach, which can be found in the field of language modeling. These two data selection techniques applied to SMT have been already presented in the literature. However, edit distance for this task is proposed in this paper for the first time. After investigating the individual model, a combination of all three techniques is proposed at both corpus level and model level. Comparative experiments are conducted on Hong Kong law Chinese-English corpus and the results indicate the following: (i) the constraint degree of similarity measuring is not monotonically related to domain-specific translation quality; (ii) the individual selection models fail to perform effectively and robustly; but (iii) bilingual resources and combination methods are helpful to balance out-of-vocabulary (OOV) and irrelevant data; (iv) finally, our method achieves the goal to consistently boost the overall translation performance that can ensure optimal quality of a real-life SMT system.

Asunto(s)

Inteligencia Artificial , Modelos Teóricos

Chinese unknown word recognition for PCFG-LA parsing.

Huang, Qiuping; He, Liangye; Wong, Derek F; Chao, Lidia S.

ScientificWorldJournal ; 2014: 959328, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-24895681

RESUMEN

This paper investigates the recognition of unknown words in Chinese parsing. Two methods are proposed to handle this problem. One is the modification of a character-based model. We model the emission probability of an unknown word using the first and last characters in the word. It aims to reduce the POS tag ambiguities of unknown words to improve the parsing performance. In addition, a novel method, using graph-based semisupervised learning (SSL), is proposed to improve the syntax parsing of unknown words. Its goal is to discover additional lexical knowledge from a large amount of unlabeled data to help the syntax parsing. The method is mainly to propagate lexical emission probabilities to unknown words by building the similarity graphs over the words of labeled and unlabeled data. The derived distributions are incorporated into the parsing process. The proposed methods are effective in dealing with the unknown words to improve the parsing. Empirical results for Penn Chinese Treebank and TCT Treebank revealed its effectiveness.

Asunto(s)

Vocabulario , Pueblo Asiatico , Humanos , Lenguaje , Reconocimiento en Psicología

Unsupervised quality estimation model for English to German translation and its application in extensive supervised evaluation.

Han, Aaron L-F; Wong, Derek F; Chao, Lidia S; He, Liangye; Lu, Yi.

ScientificWorldJournal ; 2014: 760301, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-24892086

RESUMEN

With the rapid development of machine translation (MT), the MT evaluation becomes very important to timely tell us whether the MT system makes any progress. The conventional MT evaluation methods tend to calculate the similarity between hypothesis translations offered by automatic translation systems and reference translations offered by professional translators. There are several weaknesses in existing evaluation metrics. Firstly, the designed incomprehensive factors result in language-bias problem, which means they perform well on some special language pairs but weak on other language pairs. Secondly, they tend to use no linguistic features or too many linguistic features, of which no usage of linguistic feature draws a lot of criticism from the linguists and too many linguistic features make the model weak in repeatability. Thirdly, the employed reference translations are very expensive and sometimes not available in the practice. In this paper, the authors propose an unsupervised MT evaluation metric using universal part-of-speech tagset without relying on reference translations. The authors also explore the performances of the designed metric on traditional supervised evaluation tasks. Both the supervised and unsupervised experiments show that the designed methods yield higher correlation scores with human judgments.

Asunto(s)

Modelos Teóricos , Traducción , Inglaterra , Alemania

A supportive attribute-assisted discretization model for medical classification.

Wong, Derek F; Chao, Lidia S; Zeng, Xiao Dong.

Biomed Mater Eng ; 24(1): 289-95, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-24211909

RESUMEN

Discretization of a continuous-valued symptom (attribute) in medical data set is a crucial preprocessing step for the medical classification task. This paper proposes a supportive attribute - assisted discretization (SAAD) model for medical diagnostic problems. The intent of this approach is to discover the best supportive symptom that correlates closely with the continuous-valued symptom being discretized and to conduct the discretization process using the significant supportive information that is provided by the best supportive symptom, because we hypothesize that a good discretization scheme should rely heavily on the interaction between a continuous-valued attribute and both its supportive attribute and the class attribute. SAAD can consider each continuous-valued symptom differently and intelligently, which allows it to be capable of minimizing the information lost and the data uncertainty. Hence, SAAD results in higher classification accuracy. Empirical experiments using ten real-life datasets from the UCI repository were conducted to compare the classification accuracy achieved by several prestigious classifiers with SAAD and other state-of-the-art discretization approaches. The experimental results demonstrate the effectiveness and usefulness of the proposed approach in enhancing the diagnostic accuracy.

Asunto(s)

Biología Computacional/métodos , Enfermedad/clasificación , Programas Informáticos , Algoritmos , Teorema de Bayes , Minería de Datos , Bases de Datos Factuales , Diagnóstico por Computador , Humanos , Modelos Teóricos , Reproducibilidad de los Resultados

Time series for blind biosignal classification model.

Wong, Derek F; Chao, Lidia S; Zeng, Xiaodong; Vai, Mang-I; Lam, Heng-Leong.

Comput Biol Med ; 54: 32-6, 2014 Nov.

Artículo en Inglés | MEDLINE | ID: mdl-25199847

RESUMEN

Biosignals such as electrocardiograms (ECG), electroencephalograms (EEG), and electromyograms (EMG), are important noninvasive measurements useful for making diagnostic decisions. Recently, considerable research has been conducted in order to potentially automate signal classification for assisting in disease diagnosis. However, the biosignal type (ECG, EEG, EMG or other) needs to be known prior to the classification process. If the given biosignal is of an unknown type, none of the existing methodologies can be utilized. In this paper, a blind biosignal classification model (B(2)SC Model) is proposed in order to identify the source biosignal type automatically, and thus ultimately benefit the diagnostic decision. The approach employs time series algorithms for constructing the model. It uses a dynamic time warping (DTW) algorithm with clustering to discover the similarity between two biosignals, and consequently classifies disease without prior knowledge of the source signal type. The empirical experiments presented in this paper demonstrate the effectiveness of the method as well as the scalability of the approach.

Asunto(s)

Algoritmos , Inteligencia Artificial , Diagnóstico por Computador/métodos , Electrodiagnóstico/métodos , Modelos Estadísticos , Reconocimiento de Normas Patrones Automatizadas/métodos , Procesamiento de Señales Asistido por Computador , Animales , Simulación por Computador , Interpretación Estadística de Datos , Humanos

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA