The Roget22 project is my second-oldest project, yet it hasn’t made it
out into the world. Its goal is to produce a beautiful, maintained,
electronic thesaurus which is as much a pleasure to peruse as the
It got stuck several years ago as a result of me painting myself into
a corner with my chosen data representation, and then getting frustrated
with the amount of work I had created for myself as a result.
But I have picked it back up, after an actual, bolt-from-the-blue
style fit of inspiration. I’m now taking a radically different – and
simpler – approach. This is going extremely well, and in a small
number of hours across the past two weeks, my new codebase has caught
up with the old one (in terms of functionality), and today began to
Now that the technical side of the project is up and running again, I
want to talk about the editorial issues that I’m facing, which I think
are far more interesting to talk about.
The corpus is a mess. My starting point was Project Gutenberg’s text
copy of the 1913 edition of Roget’s Thesaurus. And I am very grateful
to have that as a beginning. However…
Project Gutenberg got the text file by OCRing a physical book
in 1984. If you had asked me if OCR was even possible in 1984, my
first response would have been “of course not,” but it’s right there
in the acknowledgements. There are a lot of OCR errors, the most
insidious of which have to do with punctuation.
This is problematic because Roget used a scheme of standardized
punctuation to impose and communicate structure within the corpus. So
every punctuation error automatically becomes an error in the
structure of the document. Errors of this class are frustrating
because of their number, but are easy to resolve.
Lack of care
Much more difficult to resolve are the issues created by the
additions made to the corpus by random, unattributed PG volunteers
over the years. I’m not talking about correcting OCR errors; I’m
talking about the many, many terms which have been added to the text.
The problem with these additions in manifold:
They are not flagged in any way, so they cannot be mechanically removed
They are not evenly distributed. Some headwords have been left
entirely alone; some have had one or two terms added; others have
had several large lists of words tacked on
They are, largely, not thoughtful. Roget was beyond thoughtful and
careful in his classification and arrangement of words. The work of
the editors of modern paper editions of the thesaurus shows the same
care. The additions made by PG volunteers does not. In many cases,
terms are added in places where they nearly mean the thing being
described, which might be the worst sin one could commit against
Roget. Other cases are utterly thoughtless, exemplified by a giant
list of slang terms for “penis”.
Sometimes they are simply wrong. I have found many adverbial
clauses added as adjectives, as an example. If you care about
language at all, it is not difficult to discriminate between an
adjective and an adverb.
The “list senses” are particularly bad about lack-of-care
issues. Here’s one as an example.
[devices for talking beyond hearing distance: list] telephone,
phone, telephone booth, intercom, house phone, radiotelephone,
radiophone, wireless, wireless telephone, mobile telephone, car
radio, police radio, two-way radio, walkie-talkie [Mil.],
handie-talkie, citizen’s band, CB, amateur radio, ham radio,
short-wave radio, police band, ship-to-shore radio, airplane radio,
control tower communication; (communication) 525, 527, 529, 531,
532; electronic devices
Some of these seem perfectly fine. Some (“house phone”? “car radio”
separate from “CB”? also, that should be “CB radio”) need a bit of
investigation or tweaking. But then… yes, “walkie-talkie” was
originally military jargon, but that hasn’t been true for over 70
years now. And what even is this:
(communication) 525, 527, 529, 531, 532;
I know that I’m likely the only one here who is steeped in the
formatting of this work, but please trust me when I say that this
looks nothing like anything that occurs anywhere else in the
corpus. My best guess at an interpretation is that the five listed
headwords have something to do with communication, so why not just
throw them all in as references for “devices for talking beyond
While we’re at it, there’s a word for “talking beyond hearing
distance”: telecommunications. You’re going to add six different kinds
of radios (including “two way” and “ship to shore”, which are more
classes than devices – ugh! it never ends!) but you’re not going
to change that incredibly stilted bundle of words to
One quirk that is the fault of Roget himself is that many headwords
include a list of what he termed “phrases”.
To begin with, most of these phrases would more accurately be
described as “quotes”, but that’s a nit-pick rather than an actual
The real problems start with the fact that the overwhelming number of
these quotes/aphorisms/etc are not in English. The full title of the
work includes “of the English language”, so why are there hundreds of
quotes in Latin (so much Virgil!), German, French, and Spanish? Well,
because Roget was an educated Englishman working in the mid-to-late
nineteenth century. But historical context, while perhaps
illuminating, doesn’t make all these quotes useful to a modern,
There is substantial crossover with the previous issue here, as
well. The phrase section of many headwords has clearly been used as a
dumping ground by volunteer editors. In addition to all the possible
problems you might be imagining, I have also found a large number of
single words dumped into this section – single words, by
definition, not being phrases.
Even among the phrases/quotes which are actually from Roget, and are
actually in English, there are problems. Most of these are from
English literature, and at least a few of them are problematic to
modern sensibilities. Also, in a decent number of cases, they are
attributed not to their authors, but to the works from which they
come. The issues described in this paragraph are relatively easy to
fix, but I almost find it fascinating how much trouble this single
section can be.
Finally, continuing on the theme of being offensive, a nonzero number
of quotes was added by some PG volunteer with a fondness for
politicians from the southern US. One (“If it ain’t broke, don’t fix
it”) isn’t problematic and the person it’s attributed to (though he
popularized it rather than coining it) also didn’t do anything
objectionable according to his wiki entry. But I found another by
noted racist George Wallace. I don’t yet know what else is lurking
Yes, there’s more. As one might expect from a work compiled by an
English gentleman of the nineteenth century, there’s a decent amount
of general “oh god don’t say that” scattered about. And then there’s
the giant lists of words for “poop” and “boners” that PG volunteers
added (which in my mind is problematic because it commits the sin of
being pointless and low-effort, but I already talked about that).
I had actually been feeling torn on this topic for a while: is the
right thing to keep these usages, but flag them as offensive and
caution against use? Or should they simply be thrown out, historicity
be damned? At what point does one cross the line into Bowdlerization?
Then, in the course of fixing a completely unrelated issue, I stumbled
on the term “worth a Jew’s eye”, which manages to be fantastically
evocative in a particularly horrible and racist way.
And suddenly I had clarity: generalized offensiveness, like “dirty
words” will stay, and be flagged in order to provide guidance to
readers. But this will still be handled carefully, because slang
dictionaries already exist, and this work doesn’t need to become that
thing. This is to be a thoughtfully curated work that embraces the
breadth and depth of language, not Roget’s Urbandictionary.
Meanwhile, anything that involves racism, sexism, or any other sort
of harm against people, will be removed.
Bringing it up to date
This is the biggest problem, but it’s also the easiest to state and
needs the least explanation: there’s a century of words missing.
So that’s the broad strokes of the challenges ahead. No, I don’t plan
to do this alone. Yes, I do have a plan to try to make it easier. But
we can talk about that later.