One of my longest-running projects is about to come out of dormancy, again. The thing that would eventually get the name “Roget22” started in, I believe, 2004, when I stumbled on the OCR’d text of the 1921 edition of Roget’s Thesaurus on the Project Gutenberg site.

For the unfamiliar, I point you toward the wiki entry for Project Gutenberg (hereafter “PG”).

According to the credits in the file, the original OCR job had been done in 1984 at MICRA Corp. However, it turns out that MICRA is still around, and their own website contradicts this on a few grounds:

  • They state that while they did contribute a version of Roget’s to PG, it was in 1988 rather than 1984 (it’s possible I misremembered the date in the attributon, but I don’t think I did because I remember thinking that having a scanner and OCR software in 1984 would have been an incredible novelty, and fantastically expensive)
  • They state that the edition they donated was a 1911, rather than a 1921
  • Finally, they state that they did not OCR it, but rather typed it all by hand, and that the file they contributed was superseded some years later by one which was sourced from OCR. This would have been the file I was using

Maybe you’re surprised that an electronic version of one of the foundational reference works of the English language – housed in what aspired to be the grandest repository of freely-available books on the internet – would contain so many errors, just in the front matter! But having worked with it for years, I am not surprised in the least.

Early days

My first thought upon discovering the existence of this file was to see if anyone had put it on the web as a searchable resource. The answer was yes, but poorly.

I have always personally loved words and language, and on the professional side I love the design of information. Roget’s thesaurus is a wonderful joining of those worlds. This isn’t the place for (yet another) diatribe on the superiority of Roget’s original design over the now-more-common dictionary type. But if you are a lover of words, it is possible to get lost in, reading “up” or “down” from the term you originally looked up, soaking in the nuance of meanings.

And what I found online, circa 2002, were chop jobs that simply chunked the text file roughly and presented whichever ones matched a simple text search. But

  1. …that’s not how the index of a proper Roget’s presents things, and
  2. …it completely loses the joyous browsability of paper editions.

So I set to work, trying to create something that would give an experience closer to what I loved about the original.

Problems

It’s very, very fair to ask what nigh-insurmountable technical issues have prevented me from producing a solution to that problem over the intervening twenty-odd years. And the answer is that there aren’t any.

The problems lie entirely within the source file from PG.

I will spare you the voluminous details. The core issue is that the file I got from PG was not the file that was uploaded in the late 80s by MICRA, nor was it the file uploaded in the early 90s by whomever did the OCR job.

I would have assumed that PG would present texts as-is, unchanging and inviolate, excepting perhaps some initial copyediting phase before acceptance. But this has clearly not been the case with the Roget’s text. The file itself admits to this in some ways, noting that a contributor created an index at some point, making things more searchable – and that is certainly not a problematic addition.

But what I discovered over the years was a large series of apparently uncontrolled and uncoordinated edits to the original text.

Sometimes it would be a single word added in, standing out because of Roget’s extreme attention and care to shades of meaning and sense. Sometimes it would be a list of terms, dozens long, slugged in because of someone’s good intentions to “update” the text with things that didn’t exist in 1921 – like computing terms. The problem with that is that the thesaurus is not, either in its early editions or in more modern ones, a jargon list or technical dictionary. And there were dozens more cases, of every possible sort.

The more I tried to clean things up – and the more historical editions of Roget’s that I accumulated to act as guides for what was original/correct – the more issues I found.

I would become demoralized by this, and then after a while I would become ashamed of having gone for so long with no work on the project. Then I would do some more work, targeted at a specific class of problems, or a new mitigating strategy or toolchain. But inevitably, I would get back to the beginning of the cycle, where I would find yet another set of issues that needed cleaning before I could get around to the whole point of the exercise: making the thesaurus more available to more people.

Eventually, I realized that despite my goal and all my work, what I had actually produced was a text which increasingly-closely approached the original corpus, while being embodied as a representation which was increasingly further and further from being human-readable. And okay, that last part was neat in a technically abstract way, but…

A new strategy

Somewhat fancifully on my lifetime TODO list since the late 1990s has been to acquire a book scanner and digitize my collection. I don’t own a lot (or possibly any?) monetarily valuable books, but I have a number that aren’t super easy to find copies of.

A few years ago I discovered that a new class of book scanners had become available, which basically were modern, high pixel-count digital cameras working in conjunction with software to de-curve and de-skew page images. These cost a lot less than the ones at the previous low end of the market, ~$800 vs ~$5000. However, at the time I learned this, I was living in a tiny one bedroom apartment and there was a global pandemic happening. Also at that time, about 90% of my books were living in a storage unit rather than in the apartment with me. Not the best time to start the book-scanning project.

Last year, during my last fit of trying to make something usable out of my previous Roget’s work, I had a realization: it would be less effort, at that point, to simply get a book scanner and produce my own clean source text. I had a couple days of internal sunk-cost-related struggles after this realization, but before too long I decided that it actually was a reasonable course of action. And I could then start digitizing my cookbooks.

I didn’t immediately act on this idea, however. As always, there were things going on in life besides my personal projects, and I still wasn’t in a place where it felt like it made semse to drop $800 on a book scanner.

Fast-forwarding to yesterday, I thought I’d check to see if they had any new models or other updates. No new models, but the current ones now had a (slightly) cheaper list price, with an additional coupon available on Amazon.

So I ordered one.

Roget Restarts

I do not plan to throw away the technical infrastructure and tooling that I created over the years. My processing toolchain has gone through many iterations, each an improvement on the last. And the most recent changes were some of the most impactful ones since the early days of this effort, because they were less focused on the source text and instead addressed the Editing Problem – an issue I had long been kicking down the road.

(An aside: briefly, the Editing Problem is my name for the fact that the most efficient data representation of the thesaurus for viewing/consumption, is close to the worst representation of it for the purposes of editing and maintenance of the corpus. But that’s another blog post.)

I might not immediately jump back into the full text processing chain, but when the scanner arrives I am definitely going to go ahead and get the scanning workflow down. I expect I’ll have to fiddle with the software a bit, because pretty much all (paper) versions of the thesaurus are in a multi-column format, with older ones being less consistent than newer ones.

Still, the 1921 edition is a slender volume compared to modern editions, so the time needed once I have a basic working setup should be minor compared to the amount of time that has gone into this so far. I’m sure there’ll then be at least some small amount of image cleanup and/or tweaking. And then the OCR pass will happen. And then I’ll be back to machine-assisted transforms of the text.

And then there’s just the work of writing backend and client software! And that merely involves learning a completely new language for the front-end app! It’s so insane that this is where I have ended up… but I still want to make this happen, just as badly as I ever have, and I genuinely love this stuff. At least now there’s only one remaining problem: me :)

Hardware chat!

Completely unrelated to everything above, I have some new project hardware news. Since last summer I have been paring back my in-house compute resources – another thing that deserves its own blog post. But this week I added a couple of things back.

Around the beginning of summer I tried an experment of switching from using a full-fat laptop to an iPad as my primary personal computing device. This did not work out as I had wanted, due to some limitations of iOS, and I went back to using a Macbook Air. But there were things I really liked about having a tablet, especially for information consumption.

So when the 7th gen iPad Minis dropped this week I picked one up. It’s now my reading-stuff-around-the-house device and our planning-out-trips-on-mapping-apps device. And, when on camping trips, my typing-out-thoughts-into-the-notes-app device because I found a super-cute, tiny, cheap, bluetooth keyboard folio case for it.

And yesterday the new machine I ordered to take over from my old MBA as the server for my personal websites (like this one) arrived. It’s a tiny Minisforum box with a Ryzen 7000 series APU. More than enough horsepower to handle things for the next several years. I do plan to use it for other tasks down the road, but right now its job is web serving.

Last bit of compute news for now: on my TODO list for this weekend is donating more of the hardware from my years-long stint of volunteer grid computing. That’s not its own blog post, becuase it would be included in the one about be scaling back on hardware.

That’s it for now!