Journal: Natural Language Engineering
Languages: Afrikaans, Ancient Greek, Arabic, Armenian, Basque, Bulgarian, Buryat, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Kazakh, Korean, Kurmanji, Latin, Latvian, North Sami, Norwegian, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Upper Sorbian, Urdu, Uyghur, Vietnamese
Programming languages: Python
CoNLL-U Format (https://universaldependencies.org/format.html)
Project website: https://turkunlp.org/Turku-neural-parser-pipeline/
Lemmatization method based on a sequence-to-sequence neural network architecture and morphosyntactic context representation. This context-sensitive lemmatizer generates the lemma one character at a time based on the surface form characters and its morphosyntactic features obtained from a morphological tagger. Outperforms all latest baseline systems (2020). Compared to the best overall baseline this system outperforms it on 62 out of 76 treebanks reducing errors on average by 19% relative. The lemmatizer together with all trained models is made available as a part of the Turku-neural-parsing-pipeline under the Apache 2.0 license.