Orð og tunga - 26.04.2018, Blaðsíða 52
Kendra Willson: Splitting the atom 41
2 The Writt en Language Archive and uses of
traditionally sampled lexical archives
This paper is based primarily on data from the Writt en Language
Archive (Ritmálssafn Orðabókar Háskólans, ROH) of the Árni Magnús-
son Institute for Icelandic Studies. ROH contains a total of two mil-
lion examples representing some 600,000 headwords, taken from a
broad range of Icelandic texts in diff erent genres dating from 1540
onward (starting from Oddur Gott skálksson’s translation of the New
Testament, used as a cut-off point for the distinction between Old
and Modern Icelandic). The examples are classifi ed by period (from
the mid-sixteenth to the late twentieth century). A large part of the
archive is available online, the rest of it on paper slips in Reykjavík
(some examples are found only in the digital archive). The archive
thus provides a broad cross-section of word usage in texts of diff er-
ent types. Most examples appear in the database with an immediate
context (a sentence or two) but also a reference to the original source,
which makes it possible to recover a broader context. This archive,
which provides a manageable number of examples so that each att es-
tation can be examined individually, can thus be used for pilot studies
which will either proceed in a quantitative direction (working with
larger corpora) or in a qualitative direction exploring in greater depth
the use of words in specifi c texts. Although the data are limited, they
can be used to explore patt erns of coinage and semantic change.
ROH is a lexicographic database based on traditional sampling,
a hand-collection method widely used in the days before electronic
corpora and processing. Traditional lexicographic sampling involves
hand-skimming sources and noting examples which strike the lexi-
cographer’s eye; they may be perceived as representative, or as dis-
tinctive or novel. Card archives such as this one form the basis for
many lexicographic projects of the twentieth century. The selection
mode favors novel uses and nonce formations. It does not refl ect sta-
tistical frequency in usage but can give a sense of the lexical potential
of words and morphemes.
Čermák (2003:18–19) discusses ways in which traditional lexico-
graphic sampling diff ers from modern corpus linguistics. He criticizes
hand-sampling as unreliable, as humans may not be reliable judges of
“typical” or “specifi c” uses of words, although research suggests that
speakers are sensitive to frequency of words and collocations (e.g.
tunga_20.indb 41 12.4.2018 11:50:34