Indexing Git Repos With Hound

I’m very religious about creating git repositories for code I write, even for small one-off scripts and projects. Storage is cheap, so why not? For things that might be useful to other people I tend to use GitHub, and for personal projects I host my own git repos. Of course, checking code is just half the battle—committing code you write isn’t useful unless you can actually find it later. I have over 200 repos spread between GitHub and what I host locally, which means I often know that I’ve committed some useful code, but I can’t quite remember where I put it.

Today I set up Hound, a code indexing tool written by Etsy. It’s easy to set up and I’m happy with it so far. I configured mine to index all of my private git repos (that I host myself), plus all my GitHub repos. This means I’m effectively indexing all of the code I write, no matter where it is. I’m going to explain how I set up my Hound instance, which will hopefully help anyone else who’s interested in indexing their code this way.

Setting Up The Hound Indexer

Configuring Hound is pretty easy, you just create a config.json file that lists all of your repositories. Since I have a lot of repos (200+) I wanted to do this programmatically. GitHub has an API that lets you list your own repos, and my private repos exist locally on my filesystem which makes them easy to enumerate. This is little bit specific to how I have my private repos set up, but here’s the script I wrote:

#!/usr/bin/python3
#
# Generate hound config.

import argparse
import json
import os
import re
import sys
import urllib.request
import urllib.parse
from typing import Dict, Any, TextIO

GIT_RE = re.compile(r'(\w.*)\.git$')

DEFAULT_SETTINGS = {
    'max-concurrent-indexers': 1,
    'dbpath': 'data',
}


def dump_file(settings: Dict[str, Any], fileobj: TextIO) -> None:
    json.dump(settings, fileobj, indent=4, sort_keys=True)
    fileobj.write('\n')


def git_uri(fullpath: str, basepath: str) -> str:
    shortpath = fullpath[len(basepath):].lstrip('/')
    return 'git@localhost:' + shortpath


def get_localrepos(basepath: str):
    for dirpath, dirnames, filenames in os.walk(basepath):
        for d in dirnames:
            m = GIT_RE.match(d)
            if not m:
                continue
            name, = m.groups()
            yield name, git_uri(os.path.join(dirpath, d), basepath)


def get_githubrepos(username: str, page=None, per_page=100):
    path = 'https://api.github.com/users/{}/repos'.format(username)
    kw = {'per_page': 100}
    if page is not None:
        kw['page'] = page
        next_page = page + 1
    else:
        next_page = 2
    query = urllib.parse.urlencode(kw)
    with urllib.request.urlopen('{}?{}'.format(path, query)) as resp:
        body = resp.read()
        if isinstance(body, bytes):
            body = body.decode('utf-8')  # type: ignore
        data = json.loads(body)
        if data:
            for repo in data:
                yield repo['name'], repo['html_url']
            yield from get_githubrepos(username, next_page, per_page=per_page)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '-r',
        '--repo-dir',
        default='/home/git',
        help='Directory to search for local repositories')
    parser.add_argument('-o', '--output', help='Where to emit hound config')
    parser.add_argument(
        '-u', '--username', default='eklitzke', help='GitHub username')
    args = parser.parse_args()

    settings = DEFAULT_SETTINGS.copy()
    settings['repos'] = repos = {}

    def addrepos(iterable):
        for name, url in iterable:
            assert name not in repos
            repos[name] = {'url': url}

    addrepos(get_localrepos(os.path.abspath(args.repo_dir)))
    addrepos(get_githubrepos(args.username))

    if args.output:
        with open(args.output, 'w') as outfile:
            dump_file(settings, outfile)
    else:
        dump_file(settings, sys.stdout)


if __name__ == '__main__':
    main()

Originally I was using file:// URIs for my local repos, since there’s really no reason to clone a repo using SSH over localhost. However, the Hound docs said that file:// URIs don’t work with the index refresher, hence the SSH remote URIs. I created an SSH key with no passphrase that has read-only access to my git user locally for this purpose.

After generating a config you just need to run houndd in the directory with the config. You should see it indexing repos if you set everything up correctly. There are a lot of config options you can play with, feel free to experiment on your own!

Setting Up Nginx

I don’t want everyone in the world to have access to my Hound instance, since it has code I consider private. Therefore I’m protecting access to Hound using Nginx with HTTP Basic Auth. That’s good enough for me, although in an “enterprise” deployment you’d probably want something a little more sophisticated.

Apache comes with a tool called htpasswd that can be used to create a simple basic auth config in a standard format that’s understood by Nginx. Use that to generate a config file with a single user and a password of your choosing (I usually create mine using pwgen). This file is typically named .htpasswd by convention, but feel free to name it whatever you want. Once you have that set up, the Nginx config should look something like this:

server {
    listen 80;
    server_name hound.yourdomain.com;

    charset utf-8;

    location / {
        auth_basic            "Hound Search";
        auth_basic_user_file  /path/to/.htpasswd;
        proxy_pass            http://127.0.0.1:6080;
    }
}

Make sure to replace hound.yourdomain.com with the domain you want to use, and update the path for the .htpasswd file. I’m proxying to port 6080, which is the default port used by Hound.

Setting Up SSL

Since HTTP Basic Auth sends passwords over cleartext, I would recommend setting up an HTTPS cert using LetsEncrypt if you have any remotely sensitive data. This is what I am doing for my Hound instance. Use certbot to generate an SSL certificate, and then your Nginx config will look something like this:

server {
    listen 80;
    server_name hound.yourdomain.com;
    return 301 https://$server_name$request_uri;
}

server {
    # SSL configuration
    #
    listen 443;
    server_name hound.yourdomain.com;

    ssl on;
    ssl_certificate           /etc/letsencrypt/live/hound.yourdomain.com/fullchain.pem;
    ssl_certificate_key       /etc/letsencrypt/live/hound.yourdomain.com/privkey.pem;
    ssl_prefer_server_ciphers on;
    ssl_protocols             TLSv1 TLSv1.1 TLSv1.2;
    ssl_ciphers               'EECDH+AESGCM:EDH+AESGCM:AES256+EECDH:AES256+EDH';
    ssl_session_cache         shared:SSL:15m;
    ssl_session_timeout       10m;

    charset utf-8;

    location / {
        auth_basic           "Hound Search";
        auth_basic_user_file /path/to/.htpasswd;
        proxy_pass           http://127.0.0.1:6080;
    }
}

Wrapping Things Up

To fully complete this you should set up a cron to regenerate the Hound config periodically (so it will pick up new repos), and a systemd service file for houndd. I won’t fully cover these topics, since I’ve written about them elsewhere on my blog. I run my houndd as a systemd “user” unit, so it doesn’t need to be installed by root or require special privileges. If you want to do this, make sure you’ve run loginctl enable-linger username for the username Hound runs as.