December 27th, 2009

digital

Self-misdocumenting code (a 4:47 AM rant, so you get what you get)

The biggest argument for "self-documenting code" I've heard tends to run that "all documentation becomes stale": not only are project changes not represented in the specifications, code is changed out from under comments that refer to that code, and then the comments themselves are wrong. An incorrect comment is, of course, worse than no comment at all, since a future programmer will read it, assume it is correct, and fail to audit the code the comment refers to, drawing dangerously incorrect assumptions about the code. Better to simply code clearly, and avoid relying on comments to make your code make sense.

After two years mired in a pile of crap that was obviously intended to be self-documenting and just as obviously has failed in doing so, I think I have some fairly strong arguments against this philosophy. Self-documenting code is no less prone to documentation decay than any other form, but it is even harder to sort out what the code does than does bad documentation. I assert that inaccurate documentation is as easy to determine as decayed code.

Lack of architecture documentation leads to a mess of an architecture. An object that was supposed to represent connection settings for one machine became God- it's part static, part instance, with some singleton-related behavior invoked when particular instance methods are called. 93% of the 180+ codefiles call directly into it, and every variable or interface it exposes is consumed somewhere, if only in one place. It no longer has any coherent function; I can identify about three or four classes it should be split into, but it's going to be a hell of a task to disentangle them. And disentangle them I must, because two of those theoretical classes need a completely different implementation strategy, for reasons mostly related to what happened to the Windows security model. Some of these decisions were justifiable when the project was designed ten years ago, but they should have been refactored out when they became dangerously wrong.

Would documentation have helped this? Possibly. If the class had a defined scope, than maybe the cross-machine driver installation and connection functions would have been identified as not within it, and this wouldn't be nearly so nightmarish to untangle. If there was real API documentation, and there was any intent to maintain it, maybe implementors wouldn't have been so cavalier about adding new functionality exposing data that really needed to remain private to the object for a variety of consistency reasons.

But that's an assertion that documentation itself would have reined in the project; I have little hope for that. But if documentation had at least been in the code, maybe it wouldn't have taken me several months to trace- the intent of some of these truly mysterously-named classes would have made a little bit of sense. And at least when documentation is wrong, it's usually wrong about something immediately below it; when the name of a function is wrong, I have to trace into the function to discover that it is horrifyingly wrong.

My best example: In a class that can be described as a Collection of Collection of Collection of DataTypeNamedPairEvenThoughItContainsNineProperties, there is an instance method named GetVarPairGroupCollection. One would infer that this is an accessor that returns part of the contents of the object.

Despite this function returning a VarPairGroupCollection, and the one you would expect, it was also an initialization function that called a context-sensitive mutator on every object enclosed by the class.

I, for one, do not think a function named "Get" should be an O(n2) operation that alters all the data within the class, for which the program will not work if it is not called at a specific time, and which can corrupt program state if it is called during another operation that can occur on a parallel thread and only hasn't because of the grace of the scheduler. (And, to be fair, because the scheduler would have to make some truly ludicrous decisions for it to ever show up.)

At least incorrect comments are honest. They are a red flag that the code is untrustworthy. The same is true for incorrect function names- the documentation in self-documenting code- but they are much, much harder to find, and even understanding the variable names in the first place requires some significant degree of the programmer's mindset, which can be impossible to attain without some other documentation to help.