I’m going to write about something a little different today. In this article, I
want to discuss how I think code should be documented internally in corporate
environments. That is, suppose you’re an engineer working at Cogswell Cogs. I
want to talk about how I think documentation should be done at Cogswell Cogs.
There are three obvious choices for how this can be done:
- You can maintain an internal wiki, and ask engineers to keep wiki pages on
their projects up to date.
- You can use a documentation tool like
Sphinx or
Godoc and ask engineers to
keep those documents up to date. I call this approach “generated
documentation”, since typically an intermediate format such as Markdown files
(or perhaps comments in the code itself) is used to generated HTML or PDF
documentation.
- You can ask engineers to check in README files in the root of their projects.
Note that this could actually be another file extension like
README.md
or
README.rst
, it doesn’t have to be a plain text file.
How should things actually be done?
Wikis
I believe that Wikis are the worst option available for documentation.
The problem inherent to wikis is that they’re typically edited via a tool in the
browser, typically via an HTML textarea or with a WYSIWYG editor. There’s
nothing wrong with textareas or WYSIWYG editors. What is wrong is that the
source-of-truth for wiki pages is typically not checked into the source code
repository. This means that code review is unlikely to include documentation
review. This means that when grepping the code for occurrences of a string,
engineers aren’t likely to see occurrences in the documentation.
These reasons are why it’s fundamentally so difficult to keep wikis up to date
with code. It’s hard enough to get engineers to document their work in the first
place; it’s harder still to get them to do it when documentation isn’t a part of
the normal development workflow.
Generated Documentation
Documentation generation tools like Sphinx, Godoc, Javadoc, Doxygen, etc. are
great tools that produce superb documentation. A lot of them have “autodoc”
capabilities, a term that’s used to describe documentation that is automatically
generated by tools that actually analyze the source code and metadata in the
code (typically comments formatted in a certain way). This is a powerful
feature. Most of the high quality documentation you see for polished software
projects is generated using tools of this category. This is also how big
companies like Intel, Apple, and Microsoft generate documentation for their
external APIs.
If you have the energy and wherewithal to maintain documentation in this format,
I highly recommend it. However, I would also add that a lot of people don’t
have this energy. It’s not uncommon for engineers to start with an initial burst
of energy where they write a lot of great documentation, and then the
documentation becomes out of date. Of course, the situation here is better than
with a wiki, for the reasons I described earlier. But it’s still a problem that
has to be watched out for.
The main problem I see with these generated documentation tools is that there’s
a somewhat high learning curve to them. Once you’re over the curve they work
great, probably better than any other tool. But the curve is still there. The
problem I’ve seen personally is that it’s hard to maintain this documentation in
a company that’s growing quickly. You start with some engineers who are
passionate about writing great documentation and doing the right thing, who are
willing to overcome this learning curve. A year later, when the team has grown
and new engineers (or perhaps engineers from other teams) are contributing to
the documented projects, the newcomers may not understand the documentation
format and may not keep it up to date. That’s why I think if you use one of
these tools it’s imperative to be rather strict about educating engineers on how
to use these tools. If half your engineers don’t understand how to use a tool
like Sphinx then half the code contributions won’t include documentation
updates, and this will lead to out of date documentation.
Another pitfall you can run into with these tools is that in some cases the way
that documentation and code is mixed can be confusing. If you’re using autodoc
style documentation (where documentation is generated from code metadata and
comments) then the documentation is difficult to grep, since grepping the docs
requires grepping code as well. If you’re putting the docs outside of the code,
in a dedicated directory, then the opposite problem is the case: it’s easy for
engineers to miss that directory. The reason is that if you have your docs in a
dedicated directory (say, docs/
), that directory is outside the regular code
flow and therefore is easily missed by people navigating code in their text
editors. For this reason, if you use generated documentation tools I think it’s
critical to have a good search tool set up internally. Engineers relying on
command line tools like “grep” are going to miss docs either way that you
configure things, so if you don’t have a good internal search engine set up then
people are going to have difficulty finding docs.
The last issue here is related to the fact that if you work at a company that
maintains a lot of separate services or projects it’s likely that some of those
services or projects will go undocumented (simply because there’s so many of
them). This can create a negative cycle where engineers go to the internal
documentation portal, fail to find something documented, and then start assuming
in the future that other projects will also be undocumented. This causes people
to stop using the internal documentation portal—even in cases where there is
documentation! In other words, if there’s any lack of consistency here then it
can become a big trust issue. This is not an insurmountable problem, but it’s
one to be aware of. Again, good internal search tools can help here, since a
good search tool will quickly become a source of truth for people.
README Files
The last and most primitive method you can use is a file checked into the top
level of a project with a name like README
, README.md
, or README.rst
.
While this method is primitive, I’m actually a huge fan of it, and I think it’s
underappreciated.
The README file convention has been around since at least the early days of the
internet; you’ll see it, for instance, in old source code tarballs form the
1980s. However, in my mind it’s really been popularized by GitHub, which also
allows you to check in Markdown
or reStructuredText files. On GitHub
these Markdown or reStrucutedText files are automatically turned into good
looking HTML when you browse a project page. This same convention has been
copied elsewhere. For instance, if you use
Phabricator it will automatically generate HTML
documentation for projects based on a checked in README.md
or REAMDE.rst
file at the root of a directory.
This convention used by GitHub and Phabricator makes it dead easy for
engineers to get good looking docs. There’s literally no configuration
necessary—just drop a file with the appropriate name into a directory. It’s really easy
to get engineers to create these files, because the semantic overhead is so low.
There’s almost nothing to learn; certainly less, in any case, than learning a
tool like Sphinx. Because this method is so simple, it’s a lot easier to get
people to use it.
Because the README file convention is to put the file at the root of directories
(typically one at the project root, and occasionally in subdirectories as well)
it’s impossible to not notice this file. Engineers will have to see it as they
navigate the source code.
Typically the formatting and linking capabilities of a README file are not as
extensive as what you’d have with a tool like Sphinx or Doxygen, but you can do
a pretty good job. GitHub and Phabricator support features such as syntax
highlighting, tables, inline images, and intra-document links, which means that
you can actually do quite a lot with this simple documentation format.
If you use README files you don’t really need to have a documentation search
solution (although it doesn’t hurt if you do have one). The reason is that
engineers will already be in the habit of looking at code repositories for code
they’re using, and therefore will have the expectation that they will find
documentation alongside the code, in a predictable location.
The prevailing argument I have here is that
worse is better. README files
are dead simple and rather limited—but that same simplicity makes these files
much easier to get engineers to actually use and keep up to date.
Conclusions
Don’t use wikis to document code. Wikis can work well for other things, but if
you ask engineers to keep code documented on a wiki you’ll find that the wiki
quickly becomes out of date and misleading.
Documentation generation tools like Sphinx can produce beautiful, high quality
documents. But beware of the steep learning curve: it can cause some engineers
to not document their code! If you do use a documentation generation tool, make
sure that you have strong internal training practices to get new engineers up to
speed on how to use the documentation tools. Likewise, make sure you’ve thought
of how engineers will search the documentation, and make sure the search works
well.
README files can be a good, practical alternative to generated documentation.
Because they’re so simple everyone will understand how they work, where they
live, and how to edit them. Modern formats like Markdown and reStructuredText
mean that you can use README files and still get beautiful generated HTML.