Roget22 Dev Diary 0003

Sometimes when you work on something long enough you get tunnel vision. This is a pretty well-known phenomenon, and I had a particularly intense episode of it around the Roget22 project – but I now see a clearer, easier path to making things happen.

Yes, compared to what I last wrote.

I don’t think the why of it is particularly difficult to describe or understand. Many years of dealing with bottomlessly problematic source material led to the following assumptions:

The entire corpus needed to be mostly cleaned up before further processing could occur
Problems which were determined to be systemic but regular enough could be punted on and handled by a toolchain, after all required manual cleanups were done
Therefore the toolchain was perpetually partially-done
…and there would, for the foreseeable future, be more cleanups of both types discovered and catalogued

These all fed into each other, very strongly reinforcing the view that the biggest problem, and also the root problem, was discovering the “bottom” of problems with the source text – which of course never happened to my satisfaction. And maybe that last part is a failing of my own, but every time I thought about just trying to get something out the door, I would circle back around to how deeply suboptimal it would be compared to if I just put in more up-front work.

So sure, deciding to scrap all the work on the PG text and just scan and OCR myself solved the root problem. What clicked yesterday was that it also solved nearly every other problem.

The list of needed manual fixes? Gone. The lists of problems to be handled via multi-stage automated processing before the final production toolchain could be written? Gone. The fact that the toolchain could not be finalized until the input corpus was finalized? No longer true.

Several dozens entire classes of issues and blockers evaporated with the realization that not only will I have a clean source text, but that I can actually modify it to better suit my purposes as I am cleaning and prepping it for processing. (See the previous devlog for the kind of details I’m thinking of here. See the first Roget22 devlog for a discussion of the root issue itself.)

The relief was palpable.

So here’s what things are looking like right now:

There is a tiny bit of tooling that needs to happen right up front (reworking existing headword metadata) to support the kind of edits I want to be able to make in the raw source files
The plan as of the last time I worked on the processing toolchain was for the canonical corpus to be machine-generated, human-readable JSON
- The plan now is to do a one-time conversion from “raw” plaintext (simple, cleaned-up OCR files) to machine-generated “cooked” plaintext (fully processed, with all linkages and richer metadata and formatting)
- The cooked plaintext will now be the canonical corpus.
Taken together, this removes the dependency between completing the OCR + cleanup process and having portions of the corpus move into final toolchain processing
- …and also means that PCR + cleanup is no longer a blocker on development of the toolchain

The previous iteration of the toolchain (so much as it did exist) was a collection of Python scripts. There’s nothing wrong with them, but I have changed the requirements so much that it makes more sense to do a rewrite rather than to try to hack the existing scripts into a different shape.

Which leads to the final interesting bit: I’m going to write the new tooling in Swift. I’ve been thinking about using it (and SwiftUI) for the final product for a while now, but I’ve more recently become aware that there’s a subset of Swift users who are using it for GP and systems work – and it ports cleanly to Linux now.

This seems like a good task for doing some learning about the basics of the language (and the Package Manager, which is like Golang’s go.mod system but rather more strict and opinionated and less opt-out-able, from my reading so far). It also feels like the language is a good fit for the project, since I want to deliver the finished thing to Mac and iOS (at least initially).

Should be fun!