Roget22 Dev Diary 0002

I almost started scanning today, but I didn’t. Instead I sat down and started working on the hairiest problem not caused by PG’s edits over the years: references.

In a modern paper edition of Roget’s, references to other headwords look something like TERM 584.2 to indicate that for other terms similar in meaning to TERM, refer to sense 2 of headword 584.

This is largely inherited from Roget’s original design of argument &c 476, read as “for more terms like this meaning of ‘argument’, see headword 476”. But there are also variants like manifest &c (be evidence of) 467 where the reader is pointed into a specific sense of a headword by use of another term, because early editions did not number the senses within a headword.

Then there are terrible things like this: experimentum crucis &c (test) 463 [Lat.]

Terrible to parse, that is. Term, reference marker (&c), the term for the reader to go find, the headword to find it in, and then the marker that this term is (or is a loanword from) Latin… which isn’t part of the reference at all.

So the busy-ness and parsing ambiguity of all this is one layer of the problem, but it absolutely is not the entirety of the problem. There’s at least one more layer, which is that when you’re operating solely on the text at this level, all references depend on the ordering of headwords. (And, in the case of modern editions, senses. If I was using that, which I am not.)

What I needed was an internal format which did not rely on the order of the headwords or their senses.

What I’ve settled on (as a first pass, at least) is to add some automated metadata to the source text files. Every sense of every headword, if it doesn’t already have one, will be assigned a v4 UUID. And while the references presented to the user might look a lot like the ones in modern editions of Roget’s, internally they will look more like:

# for theoretical sense with UUID 79eb4e5a-a858-4d27-bb6e-3e13882ca4b1

[HWORD:79eb4e5a]      # points to a sense
[HWORD:79eb4e5a:TERM] # points to term within a sense

This makes references immutable from an editing perspective. Headwords and/or senses can be reordered and references will still resolve. This isn’t a complex solution, or an original one, but it’s something that I never got around to working on because of the overwhelming problem of trying to simply fix my source text.

I certainly have a lot of work to do, but now I don’t have either of those problems anymore.

Next time, probably: an overview of the processing chain and changes on the horizon.