Published: 2021-03-08

This is the story of trying to fix one problem, in one headword entry in the thesaurus corpus.

Having gotten my 1911 Roget’s out of storage, for use as a backstop in making determinations about errors reported by rlint, I decided it was time to start work. The first entry flagged was:

Copy : tests ✘ : Invalid char '-' in sense 3 (N sense 3) term "transcript",
                 attr "copy into a non-visual form"

The message makes it clear what the technical issue here is: currently, attributes/notations on terms are not allowed to contain hyphens. That might need to change, but for now I’m being very conservative in what characters are allowed to appear where. At this early state I’d rather generate false positives than miss problems.

But something else jumps out at me: “copy into a non-visual form” doesn’t sound like wording from the 1911 Roget’s. Another moment’s thought leads me to realize, in quick succession, that:

  • “Transcript” is a noun, not a verb, therefore
  • Even if this is valid, it should be attached to “transcribe”
  • Except: to “copy into a non-visual form” would be the opposite of transcription
    • The Latin roots of transcribe mean “across write”
    • In English the word’s definition is “to make a copy in writing”
  • If anything deserves this notation, it’s something like “narration” or “audiobook”

Checking the 1911 shows that this notation does not exist there. Just to be sure, I also check my 1941 edition – which I do not use as a source, because it is still in copyright. I use it as a “second opinion” because it’s far more similar to the 1911 than modern editions are. The notation isn’t there either; it is a Guternberg addition.

And just like that, the entire entry is now suspect. I’m going to present to you the 1911 and PG entries, in their entirety. First, the 1911 Roget’s:

21. [Result of imitation.] Copy.
N. copy, facsimile, counterpart, effigies, effigy, form, likeness, similitude,
semblance, cast, tracing, ectype; imitation &c. 19; model, representation,
adumbration, study; portrait &c. (representment) 554; resemblance.
    duplicate; transcript, transcription; reflex, reflexion; shadow, echo;
chip of the old block; reprint, reproduction; second edition &c. (repetition) 104;
rechauffe, apograph, fair copy, revise.
    parody, cariacature, burlesque, travesty, travestie, paraphrase.
    servile, servile copy, servile imitation; counterfiet &c. (deception) 545;

Adj. faithful; lifelike &c. (similar) 17; close, conscientious.

Here’s the PG:

DESC::Result of imitation
N. copy, facsimile, counterpart, effigies, effigy, form, likeness.
     image, picture, photo, xerox, similitude, semblance, ectype^, photo offset,
electrotype; imitation &c 19; model, representation, adumbration, study;
portrait &c (representation) 554; resemblance.
     duplicate, reproduction; cast, tracing; reflex, reflexion [Brit.], reflection;
shadow, echo.
     transcript [copy into a non-visual form], transcription; recording, scan.
     chip off the old block; reprint, new printing; rechauffe [Fr.]; apograph^,
fair copy.
     parody, caricature, burlesque, travesty, travestie^, paraphrase.
     [copy with some differences] derivative, derivation, modification, expansion,
extension, revision; second edition &c (repetition) 104.
     servile copy, servile imitation; plagiarism, counterfeit, fake &c (deception) 545;
Adj. faithful; lifelike &c (similar) 17; close, conscientious.
     unoriginal, imitative, derivative.

In addition to the issue I talked about earlier, there are also these:

  • Haphazard additions of terms in PG; some worthwhile, some questionable
  • A general breaking-up of the entry in PG, turning what had been subsenses into full-fledged senses
  • “Effigies” is not an English plural here; in the 1911 it is italicized, marking it as a (probably French) word, presumably meaning “effigy”
  • Bizarrely, PG did flag rechauffe as French – and this is a delightful usage, with a denotation of “reheated leftovers”: a copy in the sense of the English idioms “warmed over” or “rehashed”
    • Also, it should be réchauffé, but adding unicode everywhere it’s needed is a whole other nightmare
  • “Pasticcio” isn’t flagged as archaic in the original; it is also italicized, and it is an opera term meaning “pastiche”, with an added overtone of plagiarism
    • It should be flagged as Italian, and “pastiche” should be added
  • “Travestie” is French, not archaic as PG flags it, and I think it should simply be removed. Roget was fond of French cognates, in case you hadn’t noticed.
  • “Apograph” isn’t marked as archaic anywhere, but it definitely should be
    • Also, it means “transcription”; Roget has it in that sense. PG has split the sense at “chip of the old block” but not moved apograph, orphaning it from the English word that shares its meaning
  • If, like me, you were wondering what “revise” is doing there, it turns out to have a noun sense via printing jargon: a proof which includes corrections from an earlier proof
    • That should have a notation/attribute set on it
  • Why aren’t there verb senses listed for either?

And I’m sure there are others that I didn’t notice. What did you find?

This is a great illustration of why this work feels absolutely overwhelming at times. There’s just so much when you’re trying to be careful and do the right thing.