Year: 2,020
Journal: Natural Language Engineering
Languages: Afrikaans, Ancient Greek, Arabic, Armenian, Basque, Bulgarian, Buryat, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Kazakh, Korean, Kurmanji, Latin, Latvian, North Sami, Norwegian, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Upper Sorbian, Urdu, Uyghur, Vietnamese
Programming languages: Python
Input data:

Plain Text

Output data:

CoNLL-U Format (https://universaldependencies.org/format.html)

Lemmatization method based on a sequence-to-sequence neural network architecture and morphosyntactic context representation. This context-sensitive lemmatizer generates the lemma one character at a time based on the surface form characters and its morphosyntactic features obtained from a morphological tagger. Outperforms all latest baseline systems (2020). Compared to the best overall baseline this system outperforms it on 62 out of 76 treebanks reducing errors on average by 19% relative. The lemmatizer together with all trained models is made available as a part of the Turku-neural-parsing-pipeline under the Apache 2.0 license.

Sign In


Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.