Gripla - 20.12.2012, Page 353
351
Randall, Beth. 2005. CorpusSearch 2 Users Guide. Phila delphia: university of
Pennsylvania.<http://corpussearch.sourceforge.net/CS-manual/Contents.html>
santorini, Beatrice. 2010. Annotation manual for the Penn historical corpora and the
PCEEC. Philadelphia: university of Pennsylvania. <http://www.ling.upenn.
edu/hist-corpora/annotation/index.html>
sapp, Christopher D. 2011. „A Relative Pronoun in old norse?“ erindi flutt á
DiGs 13, university of Pennsylvania, Philadelphia, 5. júní.
Wallenberg, joel, Anton karl Ingason, einar freyr sigurðsson og eiríkur
Rögnvalds son. 2011. Icelandic Parsed Historical Corpus (IcePaHC). version
0.9. <http://www. linguist.is/icelandic_treebank>
suMMARy
The Icelandic Parsed Historical Corpus
Keywords: treebank, parsing, historical corpus, diachronic syntax, language
technology.
the article describes the background for and construction of Icelandic Parsed His-
torical Corpus, IcePaHC, a million word parsed historical corpus of Icelandic that
has just been completed (Wallenberg et al. 2011). this corpus contains fragments
of 60 texts ranging from the late twelfth century to the present day and serves the
dual purpose of being both a cornerstone of Icelandic language technology and also
an invaluable tool in Icelandic diachronic syntax research.
the corpus is unusual in many ways. first, it was designed to serve as a tool
for both language technology and syntactic research, and was developed by scholars
with research experience in both diachronic syntax and computational linguistics.
secondly, the corpus spans almost ten centuries – the oldest texts were written in
the final decades of the twelfth century and the youngest are from the first decade
of the present century. thirdly, the corpus contains over one million words and
is thus among the largest of the parsed corpora that have been published for any
language. fourthly, access to the corpus is completely open and free and thus re-
quires no registration or paperwork, and the same is true for all the software used
in its construction and also for other software developed within the project.
In the present paper, we follow the Introduction by describing the background
to the treebank, whose origins lie in three different projects. Several aspects of the
material in the treebank are then discussed – the selection of texts, their quality,
and their conversion to modern Icelandic spelling. We then explain our decision
to build a Penn style treebank and we offer an overview of the annotation process.
following a case study which shows how the treebank can be used to investigate
SÖGULEGI ÍSLENSKI TRJÁBANKINN