Whoops

Today I wrote some unit tests for the code that does the static html/xml generation for this blog. I was motivated to do this after my friend James Brown pointed out some bugs he had noticed on the site.

To add the tests, I had to significantly refactor things. Previously the whole thing was a single 251 line Python script. To add the tests, I had to refactor it into an actual Python module with different components, create a setup.py file with console_scripts entry points, create a requirements.txt file, set up pytest, and so on. The tests validate a bunch of things and do crazy things with lxml and XPath queries to do validation of the blog content about what things should and should not be present in the generated files. All in all, the refactored code is a lot easier to test and reason about, but it's also a lot more complicated which is a bit unfortunate.

The reason I expended this this considerable amount of work instead of just dropping in the original one-liner fix is is what one might consider a security or privacy bug. A while back I had written some content that I wanted to share with friends, but that I didn't want linked anywhere or crawled. I figured a clever solution here would be to reuse the blog generation stuff I already have here so I'd get the nice CSS and whatnot, but just add a mode that would cause these pages to not be linked from the index page. I never created a robots.txt file for these pages since by nature of its existence such a file publicizes the secret URLs.

This all worked great, except for one little bug. When I generate the static site content, I also generate a file /index.rss which is picked up by RSS readers for feed syndication. The code generating the RSS file didn't know about the hidden page feature, so these hidden pages ended up in the RSS feed. I didn't notice this since I don't subscribe to the RSS feed for my own site. As a result of this bug, not only was the content visible to people browsing via RSS, it was also actually indexed by Googlebot. I was able to confirm this by doing a Google query with site:eklitzke.org my-specific-search-term. Interestingly, these pages were not indexed by Yahoo or Bing which suggests to me that the crawling backend for Google is unified with their RSS crawler, whereas the same is not true of Yahoo/Bing.

Besides fixing the root bug, all pages I generate in this way (i.e. not linked from the main index page) now specifically use meta noindex feature just in case they are ever linked to again. This is functionally similar to a robots.txt file but doesn't publicize the URLs. I also registered my site with the Google webmaster tools and explicitly requested that they take down the URLs I didn't want indexed.

All is good now. I guess the moral of the story is that for any program that is even remotely interesting, it's worth spending a bit of time to write tests. And hat tip to James for reporting the issue in the first place.