Gripla - 2020, Side 31
GRIPLA30
a different result. We begin in the first test with MFW set to 100. Second,
we may wish to only consider words which appear in a certain number
of documents and remove the other words from the list of frequencies.
This prevents words which are unique to one or more documents from
contributing to the result. This is known as the “culling” parameter. With
culling set to 100%, a word must be present in every single document to
be included. This would allow us to remove the influence of anomalous
words which appear in one redaction but not the other, focusing instead
on more general patterns. But it has the possible downside of eliminating
words which may be truly characteristic of a redaction. We begin with
culling set to 100%. After applying these two parameters, the resulting list
of frequencies are then normalized as z-scores and the distances between
the documents are computed with these matrices using the cosine distance
metric.63 This results in a number between 0 and 2, with 0 indicating that
two documents are exactly the same and 2 indicating that two documents
have nothing in common.
In Figure 2, we observe the distances between A-divergent (in dark
gray) and C-divergent (in light gray) to A-parallel (on the left-hand side)
and C-parallel (on the right-hand side). As a reminder, the smaller the
number, the more related the documents are stylometrically. Thus, the
two closest documents are C-parallel and C-divergent, which have a co-
sine distance of 1.383, while the least similar documents are A-parallel and
A-divergent with a cosine distance of 1.534. In this experiment, it turns out
that C-divergent is closer to A-parallel than A-divergent is with a cosine
distance of 1.497. Meanwhile, A-divergent is slightly closer to C-parallel
than it is to A-parallel with a similarity of 1.519. As it turns out, our hy-
pothesis does not accurately capture the results of this initial investigation.
Instead of texts of the same version being more similar to one another, we
observe that C-divergent is more similar to everything than A-divergent
is. Taken at face value, this means that the C-redaction would be the most
63 See Jannidis et al., “Improving Burrows’ Delta – An empirical evaluation of text distance
measures,” Book of Abstracts of the Digital Humanities Conference 2015, ADHO, UWs (2015)
for a full description. In this work, the authors demonstrate that this metric outperforms
other nearest-neighbor methods, making it a good fit for our present study. The stylometry
is implemented in R, leveraging the Stylo package, M. Eder, J. Rybicki, and M. Kestemont.
“Stylometry with R: a package for computational text analysis,” R journal 8.1 (2016):
107–21. https://journal.r-project.org/archive/2016/RJ-2016-007/index.html.