I'm very religious about creating git repositories for code I write, even for small one-off scripts and projects. Storage is cheap, so why not? For things that might be useful to other people I tend to use GitHub, and for personal projects I host my own git repos. Of course, checking code is just half the battle---committing code you write isn't useful unless you can actually find it later. I have over 200 repos spread between GitHub and what I host locally, which means I often know that I've committed some useful code, but I can't quite remember where I put it.
Today I set up Hound, a code indexing tool written by Etsy. It's easy to set up and I'm happy with it so far. I configured mine to index all of my private git repos (that I host myself), plus all my GitHub repos. This means I'm effectively indexing all of the code I write, no matter where it is. I'm going to explain how I set up my Hound instance, which will hopefully help anyone else who's interested in indexing their code this way.
Setting Up The Hound Indexer
Configuring Hound is pretty easy, you just create a config.json
file that
lists all of your repositories. Since I have a lot of repos (200+) I wanted to
do this programmatically. GitHub has an API that lets you list your own repos,
and my private repos exist locally on my filesystem which makes them easy to
enumerate. This is little bit specific to how I have my private repos set up,
but here's the script I wrote:
#!/usr/bin/python3
#
# Generate hound config.
import argparse
import json
import os
import re
import sys
import urllib.request
import urllib.parse
from typing import Dict, Any, TextIO
GIT_RE = re.compile(r'(\w.*)\.git$')
DEFAULT_SETTINGS = {
'max-concurrent-indexers': 1,
'dbpath': 'data',
}
def dump_file(settings: Dict[str, Any], fileobj: TextIO) -> None:
json.dump(settings, fileobj, indent=4, sort_keys=True)
fileobj.write('\n')
def git_uri(fullpath: str, basepath: str) -> str:
shortpath = fullpath[len(basepath):].lstrip('/')
return 'git@localhost:' + shortpath
def get_localrepos(basepath: str):
for dirpath, dirnames, filenames in os.walk(basepath):
for d in dirnames:
m = GIT_RE.match(d)
if not m:
continue
name, = m.groups()
yield name, git_uri(os.path.join(dirpath, d), basepath)
def get_githubrepos(username: str, page=None, per_page=100):
path = 'https://api.github.com/users/{}/repos'.format(username)
kw = {'per_page': 100}
if page is not None:
kw['page'] = page
next_page = page + 1
else:
next_page = 2
query = urllib.parse.urlencode(kw)
with urllib.request.urlopen('{}?{}'.format(path, query)) as resp:
body = resp.read()
if isinstance(body, bytes):
body = body.decode('utf-8') # type: ignore
data = json.loads(body)
if data:
for repo in data:
yield repo['name'], repo['html_url']
yield from get_githubrepos(username, next_page, per_page=per_page)
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
'-r',
'--repo-dir',
default='/home/git',
help='Directory to search for local repositories')
parser.add_argument('-o', '--output', help='Where to emit hound config')
parser.add_argument(
'-u', '--username', default='eklitzke', help='GitHub username')
args = parser.parse_args()
settings = DEFAULT_SETTINGS.copy()
settings['repos'] = repos = {}
def addrepos(iterable):
for name, url in iterable:
assert name not in repos
repos[name] = {'url': url}
addrepos(get_localrepos(os.path.abspath(args.repo_dir)))
addrepos(get_githubrepos(args.username))
if args.output:
with open(args.output, 'w') as outfile:
dump_file(settings, outfile)
else:
dump_file(settings, sys.stdout)
if __name__ == '__main__':
main()
Originally I was using file://
URIs for my local repos, since there's really
no reason to clone a repo using SSH over localhost. However, the Hound docs said
that file://
URIs don't work with the index refresher, hence the SSH remote
URIs. I created an SSH key with no passphrase that has read-only access to my
git
user locally for this purpose.
After generating a config you just need to run houndd
in the directory with
the config. You should see it indexing repos if you set everything up correctly.
There are a lot of config options you can play with, feel free to experiment on
your own!
Setting Up Nginx
I don't want everyone in the world to have access to my Hound instance, since it has code I consider private. Therefore I'm protecting access to Hound using Nginx with HTTP Basic Auth. That's good enough for me, although in an "enterprise" deployment you'd probably want something a little more sophisticated.
Apache comes with a tool called htpasswd
that can be used to create a simple
basic auth config in a standard format that's understood by Nginx. Use that to
generate a config file with a single user and a password of your choosing (I
usually create mine using pwgen
). This file is typically named .htpasswd
by
convention, but feel free to name it whatever you want. Once you have that set
up, the Nginx config should look something like this:
server {
listen 80;
server_name hound.yourdomain.com;
charset utf-8;
location / {
auth_basic "Hound Search";
auth_basic_user_file /path/to/.htpasswd;
proxy_pass http://127.0.0.1:6080;
}
}
Make sure to replace hound.yourdomain.com
with the domain you want to use, and
update the path for the .htpasswd
file. I'm proxying to port 6080, which is
the default port used by Hound.
Wrapping Things Up
To fully complete this you should set up a cron to regenerate the Hound config
periodically (so it will pick up new repos), and a systemd service file for
houndd
. I won't fully cover these topics, since I've written about them
elsewhere on my blog. I run my houndd
as a systemd "user" unit, so it doesn't
need to be installed by root or require special privileges. If you want to do
this, make sure you've run loginctl enable-linger username
for the username
Hound runs as.