Today I wrote some unit tests for the code that does the static html/xml
generation for this blog. I was motivated to do this after my friend
James Brown pointed out some bugs he had noticed
on the site.
To add the tests, I had to significantly refactor things. Previously the whole
thing was a single 251 line Python script. To add the tests, I had to refactor
it into an actual Python module with different components, create a
console_scripts entry points, create a
requirements.txt file, set
up pytest, and so on. The tests validate a bunch of
things and do crazy things with lxml and
XPath queries to do validation of the
blog content about what things should and should not be present in the generated
files. All in all, the refactored code is a lot easier to test and reason about,
but it’s also a lot more complicated which is a bit unfortunate.
The reason I expended this this considerable amount of work instead of just
dropping in the original one-liner fix is is what one might consider a security
or privacy bug. A while back I had written some content that I wanted to share
with friends, but that I didn’t want linked anywhere or crawled. I figured a
clever solution here would be to reuse the blog generation stuff I already have
here so I’d get the nice CSS and whatnot, but just add a mode that would cause
these pages to not be linked from the index page. I never created a
robots.txt file for these pages since by nature of its existence such a file
publicizes the secret URLs.
This all worked great, except for one little bug. When I generate the static
site content, I also generate a file /index.rss which is picked up
by RSS readers for feed syndication. The code generating the RSS file didn’t
know about the hidden page feature, so these hidden pages ended up in the RSS
feed. I didn’t notice this since I don’t subscribe to the RSS feed for my own
site. As a result of this bug, not only was the content visible to people
browsing via RSS, it was also actually indexed by Googlebot. I was able to
confirm this by doing a Google query with
my-specific-search-term. Interestingly, these pages were not indexed by Yahoo
or Bing which suggests to me that the crawling backend for Google is unified
with their RSS crawler, whereas the same is not true of Yahoo/Bing.
Besides fixing the root bug, all pages I generate in this way (i.e. not linked
from the main index page) now specifically use
meta noindex feature
just in case they are ever linked to again. This is functionally similar to a
robots.txt file but doesn’t publicize the URLs. I also registered my site with
the Google webmaster tools and explicitly requested that they take down the URLs
I didn’t want indexed.
All is good now. I guess the moral of the story is that for any program that is
even remotely interesting, it’s worth spending a bit of time to write tests. And
hat tip to James for reporting the issue in the first place.