Orð og tunga - 01.06.2015, Síða 143
Jón Friðrik og Kristín: Kvistur: Vélræn stofnhlutagreining
131
Kristín Bjamadóttir. 2005. Afleiðsla og samsetning í generatífri málfræði oggrein-
ing á íslenskum gögnum. Reykjavík: Orðabók Háskólans.
Kristín Bjarnadóttir. 2012. The Database of Modern Icelandic Inflection.
Proceedings of the workshop Lmiguage Technology for Normalization of Less-
Resourced Languages, SaLTMiL 8 - AfLaT, LREC 2012, bls. 13-18. Istanbúl.
Magnús Snædal. 1992. Hve langt má orðið vera? íslenskt mál 14:173-207.
Mannanafnanefnd. 2005. Úrskurður nr. 59/2005. http://www.urskurdir.is/
DomsOgKirkjumala/Mannanafnanefnd/2005/06.
Mörkuð íslensk málheild (MIM). Ritstj. Sigrún Helgadóttir. mim.arnastofnun.
is.
Ritmálssafn Orðabókar Háskólans. http://www.arnastofnun.is/page/gagna-
sofn_ritmalssafn.
Schiller, A. 2005. German Compound Analysis with wfsc. Proceedings of the
Fifth International Workshop of Finite State Methods in Natural Language
Processing (FSMNLP), bls. 239-246. Helsinki.
Sigrún Helgadóttir, Ásta Svavarsdóttir, Eiríkur Rögnvaldsson, Kristín
Bjarnadóttir og Hrafn Loftsson. 2012. The Tagged Icelandic Corpus
(MÍM). Proceedings of the SaLTMiL-AfLaT Workshop on "Language technol-
ogyfor normalisation ofless-resourced languages", 8th International Conference
on Language Resources and Evaluation (LREC 2012). Istanbúl.
timarit.is
Lykilorð
máltækni, orðskipting, stofnhlutagreining, samsett orð
Keywords
language technology, decompounding, constituent structure, Icelandic compounds
Abstract
Compounding is extremely productive in Icelandic and multi-word compounds are
common. The likelihood of finding previously unseen compounds in texts is thus
very high, which makes out-of-vocabulary words a problem in the use of NLP tools.
Kvistur, the decompounder described in this paper, splits Icelandic compounds and
shows their binary constituent stmcture. The probability of a constituent in an un-
known (or unanalysed) compound forming a combined constituent with either of its
neighbours is estimated, with the use of data on the constituent structure of over 240
thousand compounds from the Database ofModern lcelandic Inflection (Kristín Bjama-
dóttir 2012), and word frequencies from lslenskur orðasjóður, a corpus of approx. 550
million words. Thus, the structure of an unknown compound is derived by com-
parison with compounds with partially the same constituents and similar structure
in the training data. The granularity of the split returned by the decompounder is