Orð og tunga - 01.06.2007, Side 115
Sigrún Helgadóttir: Mörkun íslensks texta
mark, mörkun, markari
part-of-speech tag, tagging, tagger
This paper gives the results on the automatic tagging of Icelandic text, using a corpus
that was prepared for the making of the Icelandic Frequency Dictionary. The corpus
contains 590,297 running words with 59,358 word forms, including punctuation. Each
running word has been supplied with a morphosyntactic tag and the tagset conta-
ins 639 tags, including punctuation tags. Five different data-driven taggers, fnTBL,
TnT, MXPOST, /i-TBL and MBT were trained on the corpus by using ten-fold cross-
validation. The TnT tagger obtained best results for tagging or 90.36% accuracy. The
TnT and fnTBL systems allow the use of a backup lexicon. When using such a lexicon
TnT reached 91.54% tagging accuracy and fnTBL 90.06%. Methods for combining the
results of the taggers were also tested. A voting method where each tagger votes its
overall precision gave best result of the voting methods tested or 91.54% accuracy. By
utilizing the ability of the MXPOST tagger to distinguish between noun cases, rules
were composed to increase tagging accuracy to 91.81%. By using a special strategy for
simplifying tags, the TnT tagger gave 91.83% tagging accuracy. Finally, the different
strategies for improving tagging accuracy were applied in a certain order. The best
result, 93.65% accuracy, was obtained by tagging with a backup lexicon with fnTBL
and TnT, simplifying the resulting tags, voting between the simplified tags and app-
lying rules based on MXPOST. Compared with the result obtained with TnT alone,
the number of errors is reduced by 34%. By using a lexicon derived from the Morp-
hological Description of Modem Icelandic as a backup lexicon the accuracy can be
further increased. Finally an experiment was made in tagging texts that are not a part
of the corpus of the lcelandic Frequency Dictionary.
Sigrún Helgadóttir
Stofnun Árna Magnússonar í íslenskum fræðum
Neshaga 16
IS-107 Reykjavík