Documenting Code

July 7, 2016

I'm going to write about something a little different today. In this article, I want to discuss how I think code should be documented internally in corporate environments. That is, suppose you're an engineer working at Cogswell Cogs. I want to talk about how I think documentation should be done at Cogswell Cogs.

There are three obvious choices for how this can be done:

You can maintain an internal wiki, and ask engineers to keep wiki pages on their projects up to date.
You can use a documentation tool like Sphinx or Godoc and ask engineers to keep those documents up to date. I call this approach "generated documentation", since typically an intermediate format such as Markdown files (or perhaps comments in the code itself) is used to generated HTML or PDF documentation.
You can ask engineers to check in README files in the root of their projects. Note that this could actually be another file extension like README.md or README.rst, it doesn't have to be a plain text file.

How should things actually be done?

Wikis

I believe that Wikis are the worst option available for documentation.

The problem inherent to wikis is that they're typically edited via a tool in the browser, typically via an HTML textarea or with a WYSIWYG editor. There's nothing wrong with textareas or WYSIWYG editors. What is wrong is that the source-of-truth for wiki pages is typically not checked into the source code repository. This means that code review is unlikely to include documentation review. This means that when grepping the code for occurrences of a string, engineers aren't likely to see occurrences in the documentation.

These reasons are why it's fundamentally so difficult to keep wikis up to date with code. It's hard enough to get engineers to document their work in the first place; it's harder still to get them to do it when documentation isn't a part of the normal development workflow.

Generated Documentation

Documentation generation tools like Sphinx, Godoc, Javadoc, Doxygen, etc. are great tools that produce superb documentation. A lot of them have "autodoc" capabilities, a term that's used to describe documentation that is automatically generated by tools that actually analyze the source code and metadata in the code (typically comments formatted in a certain way). This is a powerful feature. Most of the high quality documentation you see for polished software projects is generated using tools of this category. This is also how big companies like Intel, Apple, and Microsoft generate documentation for their external APIs.

If you have the energy and wherewithal to maintain documentation in this format, I highly recommend it. However, I would also add that a lot of people don't have this energy. It's not uncommon for engineers to start with an initial burst of energy where they write a lot of great documentation, and then the documentation becomes out of date. Of course, the situation here is better than with a wiki, for the reasons I described earlier. But it's still a problem that has to be watched out for.

The main problem I see with these generated documentation tools is that there's a somewhat high learning curve to them. Once you're over the curve they work great, probably better than any other tool. But the curve is still there. The problem I've seen personally is that it's hard to maintain this documentation in a company that's growing quickly. You start with some engineers who are passionate about writing great documentation and doing the right thing, who are willing to overcome this learning curve. A year later, when the team has grown and new engineers (or perhaps engineers from other teams) are contributing to the documented projects, the newcomers may not understand the documentation format and may not keep it up to date. That's why I think if you use one of these tools it's imperative to be rather strict about educating engineers on how to use these tools. If half your engineers don't understand how to use a tool like Sphinx then half the code contributions won't include documentation updates, and this will lead to out of date documentation.

Another pitfall you can run into with these tools is that in some cases the way that documentation and code is mixed can be confusing. If you're using autodoc style documentation (where documentation is generated from code metadata and comments) then the documentation is difficult to grep, since grepping the docs requires grepping code as well. If you're putting the docs outside of the code, in a dedicated directory, then the opposite problem is the case: it's easy for engineers to miss that directory. The reason is that if you have your docs in a dedicated directory (say, docs/), that directory is outside the regular code flow and therefore is easily missed by people navigating code in their text editors. For this reason, if you use generated documentation tools I think it's critical to have a good search tool set up internally. Engineers relying on command line tools like "grep" are going to miss docs either way that you configure things, so if you don't have a good internal search engine set up then people are going to have difficulty finding docs.

The last issue here is related to the fact that if you work at a company that maintains a lot of separate services or projects it's likely that some of those services or projects will go undocumented (simply because there's so many of them). This can create a negative cycle where engineers go to the internal documentation portal, fail to find something documented, and then start assuming in the future that other projects will also be undocumented. This causes people to stop using the internal documentation portal---even in cases where there is documentation! In other words, if there's any lack of consistency here then it can become a big trust issue. This is not an insurmountable problem, but it's one to be aware of. Again, good internal search tools can help here, since a good search tool will quickly become a source of truth for people.

README Files

The last and most primitive method you can use is a file checked into the top level of a project with a name like README, README.md, or README.rst. While this method is primitive, I'm actually a huge fan of it, and I think it's underappreciated.

The README file convention has been around since at least the early days of the internet; you'll see it, for instance, in old source code tarballs form the 1980s. However, in my mind it's really been popularized by GitHub, which also allows you to check in Markdown or reStructuredText files. On GitHub these Markdown or reStrucutedText files are automatically turned into good looking HTML when you browse a project page. This same convention has been copied elsewhere. For instance, if you use Phabricator it will automatically generate HTML documentation for projects based on a checked in README.md or REAMDE.rst file at the root of a directory.

This convention used by GitHub and Phabricator makes it dead easy for engineers to get good looking docs. There's literally no configuration necessary---just drop a file with the appropriate name into a directory. It's really easy to get engineers to create these files, because the semantic overhead is so low. There's almost nothing to learn; certainly less, in any case, than learning a tool like Sphinx. Because this method is so simple, it's a lot easier to get people to use it.

Because the README file convention is to put the file at the root of directories (typically one at the project root, and occasionally in subdirectories as well) it's impossible to not notice this file. Engineers will have to see it as they navigate the source code.

Typically the formatting and linking capabilities of a README file are not as extensive as what you'd have with a tool like Sphinx or Doxygen, but you can do a pretty good job. GitHub and Phabricator support features such as syntax highlighting, tables, inline images, and intra-document links, which means that you can actually do quite a lot with this simple documentation format.

If you use README files you don't really need to have a documentation search solution (although it doesn't hurt if you do have one). The reason is that engineers will already be in the habit of looking at code repositories for code they're using, and therefore will have the expectation that they will find documentation alongside the code, in a predictable location.

The prevailing argument I have here is that worse is better. README files are dead simple and rather limited---but that same simplicity makes these files much easier to get engineers to actually use and keep up to date.

Conclusions

Don't use wikis to document code. Wikis can work well for other things, but if you ask engineers to keep code documented on a wiki you'll find that the wiki quickly becomes out of date and misleading.

Documentation generation tools like Sphinx can produce beautiful, high quality documents. But beware of the steep learning curve: it can cause some engineers to not document their code! If you do use a documentation generation tool, make sure that you have strong internal training practices to get new engineers up to speed on how to use the documentation tools. Likewise, make sure you've thought of how engineers will search the documentation, and make sure the search works well.

README files can be a good, practical alternative to generated documentation. Because they're so simple everyone will understand how they work, where they live, and how to edit them. Modern formats like Markdown and reStructuredText mean that you can use README files and still get beautiful generated HTML.