Down the SSH_AUTH_SOCK Rabbit Hole: A GNOME Adventure

For a couple of years now, a confluence of GNOME bugs related to SSH handling have been driving me crazy. I finally sat down and spent time diving around the internals of various GNOME components, in an effort to restore my sanity. This is my story.

Backstory

A few years ago I read about ED25519 SSH keys. I decided they were pretty cool, and as it had been a while since I last rotated my SSH keys, I switched from RSA keys to ED25519 keys.

As it turns out, gnome-keyring-daemon (GKD) does not support ED25519 keys. Let me first explain what this means, since you might be confused what the relation is between GNOME and SSH keys. Most people expect to be able to log into their desktop and have SSH "just work". This entails the following:

GKD can't load ED25519 keys at all. So you can't use it as your SSH agent if you have ED25519 keys.

I briefly entertained the idea of trying to fix the GKD code, but I quickly realized why no one has bothered. GKD has its own implementation of an SSH agent in it, including its own logic for reading RSA and DSA key files. OpenSSH doesn't ship a "libssh" that you can use to parse key files (or implement your own SSH agent), so there's a bunch of really scary code in GKD to reimplement RSA/DSA ssh-agent functionality. To add ED25519 support, you'd have to reimplement all of the ED25519 logic in GKD itself.

No problem, running a vanilla ssh-agent process is pretty easy. I put this code in my ~/.profile and called it a day:

eval $(ssh-agent)

With this solution I had to manually unlock my key on the first use, but that wasn't a huge deal, since I typically log into GNOME just once a day. For a while this worked great. Then one day Wayland came along.

Wayland

Since the earliest days of Unix, even before graphical environments existed, it's been the case that when you log into a session, your ~/.profile is sourced. If you're logging in at an actual TTY, your shell does this. When X11 was added to Unix in the 1980s, people decided to keep this convention, to avoid breaking everyone's configuration. There's a bunch of really hacky shell scripts that glue everything together to make this work for X11 sessions. The idea is that you after log in, your login manager execs a shell that sources your profile, and then the shell execs your window manager. You can think of it as a kind of shell/window manager oroborous. Anything exported in your ~/.profile will thus become exported in to all child processes of your window manager.

Linux systems today have evolved a lot, and now we have systemd and Wayland. The GNOME folks decided to get rid of these kludgey shell scripts with Wayland. The new vision is that you shouldn't be running any shell code at all to log in to a graphical session, instead everything should just get configured with systemd. So they ripped out the code that sources ~/.profile. This is technically not related to Wayland, but the code path is only enabled for Wayland sessions, which is how I ran into it.

Instead of a ~/.profile, the idea is that you add systemd user units that do whatever you had previously been doing in bash. You can write a systemd unit file that launches an ssh-agent process. You can also update the environment that systemd uses for new processes, using systemctl set-environment. So I did all of this, by creating a systemd unit file called ~/.config/systemd/user/keychain.service with the following contents:

[Unit]
Description=Start keychain

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/keychain --quiet --systemd --agents ssh

[Install]
WantedBy=default.target

This uses funtoo/keychain to manage the logic of managing an ssh-agent process. Keychain knows how to update the systemd environment. I enabled my service using systemctl --user enable keychain.service, and I could see that it was running on login. I also confirmed that it was correctly updating my environment.

However, there was a big problem. Some other process would clobber SSH_AUTH_SOCK after my unit ran. When logging in, I would open a terminal and see SSH_AUTH_SOCK=/run/user/1000/keyring/ssh, which is pointing to the busted GKD socket file that doesn't work. If I manually re-ran the keychain command it would set SSH_AUTH_SOCK back to the right thing. But I simply could not figure out how to reorder my systemd services so that my service would run last and "win" the SSH_AUTH_SOCK battle.

After wasting a lot of time trying to fix this the Right Way, I ended up just giving up and putting some dirty logic in my .bashrc:

eval $(keychain -q -Q --eval)

I felt bad though, since this is a huge hack. And it meant that my SSH_AUTH_SOCK wasn't getting propagated into Emacs. Since I launch Emacs graphically, it gets the SSH_AUTH_SOCK that's set in my systemd user session. Emacs needs to know about the value of SSH_AUTH_SOCK in order to edit remote files using TRAMP. Normally I don't need to edit remote files over SSH, but every once in a while it comes up, and when it comes up it's really helpful.

For a while I was actually avoiding Wayland precisely because I didn't want to deal with this issue. I don't care if my computer uses X11 or Wayland, I just want SSH to work the way it's always worked. I ran into some other bug a few weeks ago when using the Fedora 26 Alpha, and ended up switching to Wayland as a workaround. But I was disheartened that TRAMP no longer worked. This weekend I found a MELPA package called keychain-environment that teaches Emacs how to fix up the SSH_AUTH_SOCK specifially using keychain. This worked, and I was able to use TRAMP again. All's well that end's well, right?

Then I thought to myself: I'm a coward. This isn't honorable. What kind of engineer am I if I can't even figure out how to set SSH_AUTH_SOCK correctly?

Since this was an existential threat to my pride, I realized that the only thing to do was to go code spelunking and find the evil GNOME code that was hijacking SSH_AUTH_SOCK.

Code Spelunking

I cloned gnome-keyring from the GNOME git, and started hacking on it. There were a lot of dead ends here. Initially I thought my problem was caused by pam_gnome_keyring.so, which is this a PAM policy that causes GKD to be launched from GDM. The code in pam_gnome_keyring.so is hardcoded to launch GKD using the invocation /usr/bin/gnome-keyring-daemon --daemonize --login, and I thought I could fix this by giving it an option to suppress the SSH component of GKD. After many reboots and snafus, I actually got this all working, and was able to build a version of pam_gnome_keyring.so which has configurable components. If you're curious what that looks like, the diff is here. But it turns out that the GKD PAM stuff was a red herring: it's actually just used to get your login password into GKD, and isn't related to the propagation of SSH_AUTH_SOCK, so this didn't fix my qualms.

After that, I commented out every line in GKD that sets SSH_AUTH_SOCK. Surely if SSH_AUTH_SOCK isn't set anywhere, it can't be updated in my environment. Yet still no dice, which was a real mystery. However, while I was working on this, I found the code used to inject environment variables into the user session. Here's that code:

static void
setenv_request (GDBusConnection *conn, const gchar *env)
{
        const gchar *value;
        gchar *name;

        /* Find the value part of the environment variable */
        value = strchr (env, '=');
        if (!value)
                return;

        name = g_strndup (env, value - env);
        ++value;

        g_dbus_connection_call (conn,
                                SERVICE_SESSION_MANAGER,
                                PATH_SESSION_MANAGER,
                                IFACE_SESSION_MANAGER,
                                "Setenv",
                                g_variant_new ("(ss)",
                                               name,
                                               value),
                                NULL, G_DBUS_CALL_FLAGS_NONE,
                                -1, NULL,
                                on_setenv_reply, NULL);

        g_free (name);
}

The way this works is GKD has a D-Bus component that can call a D-Bus method called Setenv. Under the covers, this is also how systemctl set-environment works. Many moons ago I wrote some code that listened to D-Bus events, and from that work I know about dbus-monitor, which is a tool that helps you inspect that's going on in D-Bus. I logged into a virtual console (using Ctrl+Alt+F3), and then set up a pipeline like dbus-monitor | tee dbus.log. I then switched back to my GDM console and logged into GNOME. In my tee log I looked for all occurrences of SSH_AUTH_SOCK. Near the start of the log I saw the call done by my keychain.service file, which set the correct value. Later I saw another call that was clobbering it with the wrong value:

method call time=1497910250.099756 sender=:1.15 -> destination=org.freedesktop.DBus serial=2 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=UpdateActivationEnvironment
   array [
      dict entry(
         string "SSH_AUTH_SOCK"
         string "/run/user/1000/keyring/ssh"
      )
   ]

This says that the caller is :1.15. I believe there's a way in D-Bus to get the PID of the sender, which would allow me to correlate the call with the name of an actual process, but I didn't need to do that. Just by searching online for UpdateActivationEnvironment and SSH_AUTH_SOCK I was able to find some clues that there might be code in gnome-session that does invokes this method. I cloned the gnome-session code, and sure enough, I found the following:

/* hack to fix keyring until we can reorder things in 3.20
 * https://bugzilla.gnome.org/show_bug.cgi?id=738205
 */
if (g_strcmp0 (g_getenv ("XDG_SESSION_TYPE"), "wayland") == 0 &&
    g_getenv ("GSM_SKIP_SSH_AGENT_WORKAROUND") == NULL) {
        char *ssh_socket;

        ssh_socket = g_build_filename (g_get_user_runtime_dir (), "keyring", "ssh", NULL);
        gsm_util_setenv ("SSH_AUTH_SOCK", ssh_socket);
        g_free (ssh_socket);
}

That definitely looks suspicious. Especially since I'm using GNOME 3.24, which implies that I don't need this code path anyway. I looked at Bug #738205 and it looks like back in the GNOME 3.14 days, the order of the initialization of various GNOME components under Wayland was incorrect. So this really hacky code was added to gnome-session as a stopgap to force it to propagate a value of SSH_AUTH_SOCK that points to GKD.

I deleted these lines of code and... success! Everything works! I can log into GNOME without having my SSH_AUTH_SOCK clobbered.

When I went to the GNOME Bugzilla to report the issue and send a patch, I found that someone had already reported this in Bug #772919. If I had known this, I could have saved a lot of time. C'est la vie. I've attached my "patch" to this bug, but the patch isn't that interesting: it just deletes the entire "if" branch that munges SSH_AUTH_SOCK.

Was it worth X hours of my time to dive through all of this weird GNOME C code? Objectively, the answer is "no". I already had a hacky workaround for bash and Emacs, so I could have just lived my life as a coward. But that's no way to live life, and doing the Right Thing is much more spiritually fulfilling.