Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 3 de 3
Filter
Add more filters

Database
Language
Publication year range
1.
Entropy (Basel) ; 22(1)2019 Dec 30.
Article in English | MEDLINE | ID: mdl-33285823

ABSTRACT

We propose a quantitative approach for quantifying morphological complexity of a language based on text. Several corpus-based methods have focused on measuring the different word forms that a language can produce. We take into account not only the productivity of morphological processes but also the predictability of those morphological processes. We use a language model that predicts the probability of sub-word sequences within a word; we calculate the entropy rate of this model and use it as a measure of predictability of the internal structure of words. Our results show that it is important to integrate these two dimensions when measuring morphological complexity, since languages can be complex under one measure but simpler under another one. We calculated the complexity measures in two different parallel corpora for a typologically diverse set of languages. Our approach is corpus-based and it does not require the use of linguistic annotated data.

2.
Linguist Vanguard ; 9(Suppl1): 9-25, 2023 May.
Article in English | MEDLINE | ID: mdl-37275745

ABSTRACT

In linguistics, there is little consensus on how to define, measure, and compare complexity across languages. We propose to take the diversity of viewpoints as a given, and to capture the complexity of a language by a vector of measurements, rather than a single value. We then assess the statistical support for two controversial hypotheses: the trade-off hypothesis and the equi-complexity hypothesis. We furnish meta-analyses of 28 complexity metrics applied to texts written in overall 80 typologically diverse languages. The trade-off hypothesis is partially supported, in the sense that around one third of the significant correlations between measures are negative. The equi-complexity hypothesis, on the other hand, is largely confirmed. While we find evidence for complexity differences in the domains of morphology and syntax, the overall complexity vectors of languages turn out virtually indistinguishable.

3.
Front Artif Intell ; 5: 995667, 2022.
Article in English | MEDLINE | ID: mdl-36530357

ABSTRACT

Little attention has been paid to the development of human language technology for truly low-resource languages-i.e., languages with limited amounts of digitally available text data, such as Indigenous languages. However, it has been shown that pretrained multilingual models are able to perform crosslingual transfer in a zero-shot setting even for low-resource languages which are unseen during pretraining. Yet, prior work evaluating performance on unseen languages has largely been limited to shallow token-level tasks. It remains unclear if zero-shot learning of deeper semantic tasks is possible for unseen languages. To explore this question, we present AmericasNLI, a natural language inference dataset covering 10 Indigenous languages of the Americas. We conduct experiments with pretrained models, exploring zero-shot learning in combination with model adaptation. Furthermore, as AmericasNLI is a multiway parallel dataset, we use it to benchmark the performance of different machine translation models for those languages. Finally, using a standard transformer model, we explore translation-based approaches for natural language inference. We find that the zero-shot performance of pretrained models without adaptation is poor for all languages in AmericasNLI, but model adaptation via continued pretraining results in improvements. All machine translation models are rather weak, but, surprisingly, translation-based approaches to natural language inference outperform all other models on that task.

SELECTION OF CITATIONS
SEARCH DETAIL