Reboot
Roget22
Published: 2021-02-22

The Roget22 project is my second-oldest project, yet it hasn’t made it out into the world. Its goal is to produce a beautiful, maintained, electronic thesaurus which is as much a pleasure to peruse as the original.

It got stuck several years ago as a result of me painting myself into a corner with my chosen data representation, and then getting frustrated with the amount of work I had created for myself as a result.

But I have picked it back up, after an actual, bolt-from-the-blue style fit of inspiration. I’m now taking a radically different – and simpler – approach. This is going extremely well, and in a small number of hours across the past two weeks, my new codebase has caught up with the old one (in terms of functionality), and today began to surpass it.

Now that the technical side of the project is up and running again, I want to talk about the editorial issues that I’m facing, which I think are far more interesting to talk about.

Validity

The corpus is a mess. My starting point was Project Gutenberg’s text copy of the 1913 edition of Roget’s Thesaurus. And I am very grateful to have that as a beginning. However…

Project Gutenberg got the text file by OCRing a physical book in 1984. If you had asked me if OCR was even possible in 1984, my first response would have been “of course not,” but it’s right there in the acknowledgements. There are a lot of OCR errors, the most insidious of which have to do with punctuation.

This is problematic because Roget used a scheme of standardized punctuation to impose and communicate structure within the corpus. So every punctuation error automatically becomes an error in the structure of the document. Errors of this class are frustrating because of their number, but are easy to resolve.

Lack of care

Much more difficult to resolve are the issues created by the additions made to the corpus by random, unattributed PG volunteers over the years. I’m not talking about correcting OCR errors; I’m talking about the many, many terms which have been added to the text.

The problem with these additions in manifold:

  • They are not flagged in any way, so they cannot be mechanically removed

  • They are not evenly distributed. Some headwords have been left entirely alone; some have had one or two terms added; others have had several large lists of words tacked on

  • They are, largely, not thoughtful. Roget was beyond thoughtful and careful in his classification and arrangement of words. The work of the editors of modern paper editions of the thesaurus shows the same care. The additions made by PG volunteers does not. In many cases, terms are added in places where they nearly mean the thing being described, which might be the worst sin one could commit against Roget. Other cases are utterly thoughtless, exemplified by a giant list of slang terms for “penis”.

  • Sometimes they are simply wrong. I have found many adverbial clauses added as adjectives, as an example. If you care about language at all, it is not difficult to discriminate between an adjective and an adverb.

The “list senses” are particularly bad about lack-of-care issues. Here’s one as an example.

[devices for talking beyond hearing distance: list] telephone, phone, telephone booth, intercom, house phone, radiotelephone, radiophone, wireless, wireless telephone, mobile telephone, car radio, police radio, two-way radio, walkie-talkie [Mil.], handie-talkie, citizen’s band, CB, amateur radio, ham radio, short-wave radio, police band, ship-to-shore radio, airplane radio, control tower communication; (communication) 525, 527, 529, 531, 532; electronic devices

Some of these seem perfectly fine. Some (“house phone”? “car radio” separate from “CB”? also, that should be “CB radio”) need a bit of investigation or tweaking. But then… yes, “walkie-talkie” was originally military jargon, but that hasn’t been true for over 70 years now. And what even is this:

(communication) 525, 527, 529, 531, 532;

I know that I’m likely the only one here who is steeped in the formatting of this work, but please trust me when I say that this looks nothing like anything that occurs anywhere else in the corpus. My best guess at an interpretation is that the five listed headwords have something to do with communication, so why not just throw them all in as references for “devices for talking beyond hearing distance”?

While we’re at it, there’s a word for “talking beyond hearing distance”: telecommunications. You’re going to add six different kinds of radios (including “two way” and “ship to shore”, which are more classes than devices – ugh! it never ends!) but you’re not going to change that incredibly stilted bundle of words to “telecommunication devices”?

“Phrases”

One quirk that is the fault of Roget himself is that many headwords include a list of what he termed “phrases”.

To begin with, most of these phrases would more accurately be described as “quotes”, but that’s a nit-pick rather than an actual problem.

The real problems start with the fact that the overwhelming number of these quotes/aphorisms/etc are not in English. The full title of the work includes “of the English language”, so why are there hundreds of quotes in Latin (so much Virgil!), German, French, and Spanish? Well, because Roget was an educated Englishman working in the mid-to-late nineteenth century. But historical context, while perhaps illuminating, doesn’t make all these quotes useful to a modern, English-speaking audience.

There is substantial crossover with the previous issue here, as well. The phrase section of many headwords has clearly been used as a dumping ground by volunteer editors. In addition to all the possible problems you might be imagining, I have also found a large number of single words dumped into this section – single words, by definition, not being phrases.

Even among the phrases/quotes which are actually from Roget, and are actually in English, there are problems. Most of these are from English literature, and at least a few of them are problematic to modern sensibilities. Also, in a decent number of cases, they are attributed not to their authors, but to the works from which they come. The issues described in this paragraph are relatively easy to fix, but I almost find it fascinating how much trouble this single section can be.

Finally, continuing on the theme of being offensive, a nonzero number of quotes was added by some PG volunteer with a fondness for politicians from the southern US. One (“If it ain’t broke, don’t fix it”) isn’t problematic and the person it’s attributed to (though he popularized it rather than coining it) also didn’t do anything objectionable according to his wiki entry. But I found another by noted racist George Wallace. I don’t yet know what else is lurking in there.

General offensiveness

Yes, there’s more. As one might expect from a work compiled by an English gentleman of the nineteenth century, there’s a decent amount of general “oh god don’t say that” scattered about. And then there’s the giant lists of words for “poop” and “boners” that PG volunteers added (which in my mind is problematic because it commits the sin of being pointless and low-effort, but I already talked about that).

I had actually been feeling torn on this topic for a while: is the right thing to keep these usages, but flag them as offensive and caution against use? Or should they simply be thrown out, historicity be damned? At what point does one cross the line into Bowdlerization?

Then, in the course of fixing a completely unrelated issue, I stumbled on the term “worth a Jew’s eye”, which manages to be fantastically evocative in a particularly horrible and racist way.

And suddenly I had clarity: generalized offensiveness, like “dirty words” will stay, and be flagged in order to provide guidance to readers. But this will still be handled carefully, because slang dictionaries already exist, and this work doesn’t need to become that thing. This is to be a thoughtfully curated work that embraces the breadth and depth of language, not Roget’s Urbandictionary.

Meanwhile, anything that involves racism, sexism, or any other sort of harm against people, will be removed.

Bringing it up to date

This is the biggest problem, but it’s also the easiest to state and needs the least explanation: there’s a century of words missing.


So that’s the broad strokes of the challenges ahead. No, I don’t plan to do this alone. Yes, I do have a plan to try to make it easier. But we can talk about that later.