For a couple of years now, a confluence of GNOME bugs related to SSH handling have been driving me crazy. I finally sat down and spent time diving around the internals of various GNOME components, in an effort to restore my sanity. This is my story.
Backstory
A few years ago I read about ED25519 SSH keys. I decided they were pretty cool, and as it had been a while since I last rotated my SSH keys, I switched from RSA keys to ED25519 keys.
As it turns out, gnome-keyring-daemon (GKD) does not support ED25519 keys. Let me first explain what this means, since you might be confused what the relation is between GNOME and SSH keys. Most people expect to be able to log into their desktop and have SSH "just work". This entails the following:
- The session should have
SSH_AUTH_SOCK
set with a connection to an SSH agent. - If any SSH keys are protected with the same password as the user's login password, they should be automatically unlocked.
GKD can't load ED25519 keys at all. So you can't use it as your SSH agent if you have ED25519 keys.
I briefly entertained the idea of trying to fix the GKD code, but I quickly realized why no one has bothered. GKD has its own implementation of an SSH agent in it, including its own logic for reading RSA and DSA key files. OpenSSH doesn't ship a "libssh" that you can use to parse key files (or implement your own SSH agent), so there's a bunch of really scary code in GKD to reimplement RSA/DSA ssh-agent functionality. To add ED25519 support, you'd have to reimplement all of the ED25519 logic in GKD itself.
No problem, running a vanilla ssh-agent process is pretty easy. I put this code
in my ~/.profile
and called it a day:
eval $(ssh-agent)
With this solution I had to manually unlock my key on the first use, but that wasn't a huge deal, since I typically log into GNOME just once a day. For a while this worked great. Then one day Wayland came along.
Wayland
Since the earliest days of Unix, even before graphical environments existed,
it's been the case that when you log into a session, your ~/.profile
is
sourced. If you're logging in at an actual TTY, your shell does this. When X11
was added to Unix in the 1980s, people decided to keep this convention, to avoid
breaking everyone's configuration. There's a bunch of really hacky shell scripts
that glue everything together to make this work for X11 sessions. The idea is
that you after log in, your login manager execs a shell that sources your
profile, and then the shell execs your window manager. You can think of it as a
kind of shell/window manager oroborous. Anything exported in your ~/.profile
will thus become exported in to all child processes of your window manager.
Linux systems today have evolved a lot, and now we have systemd and Wayland. The
GNOME folks decided to get rid of these kludgey shell scripts with Wayland. The
new vision is that you shouldn't be running any shell code at all to log in to a
graphical session, instead everything should just get configured with systemd.
So they ripped out the code that sources ~/.profile
. This is technically not
related to Wayland, but the code path is only enabled for Wayland sessions,
which is how I ran into it.
Instead of a ~/.profile
, the idea is that you add systemd user units that do
whatever you had previously been doing in bash. You can write a systemd unit
file that launches an ssh-agent
process. You can also update the environment
that systemd uses for new processes, using systemctl set-environment
. So I did
all of this, by creating a systemd unit file called
~/.config/systemd/user/keychain.service
with the following contents:
[Unit]
Description=Start keychain
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/keychain --quiet --systemd --agents ssh
[Install]
WantedBy=default.target
This uses funtoo/keychain to manage the logic of managing an ssh-agent
process. Keychain knows how to update the systemd environment. I enabled my
service using systemctl --user enable keychain.service
, and I could see that
it was running on login. I also confirmed that it was correctly updating my
environment.
However, there was a big problem. Some other process would clobber
SSH_AUTH_SOCK
after my unit ran. When logging in, I would open a terminal and
see SSH_AUTH_SOCK=/run/user/1000/keyring/ssh
, which is pointing to the busted
GKD socket file that doesn't work. If I manually re-ran the keychain command
it would set SSH_AUTH_SOCK
back to the right thing. But I simply could not
figure out how to reorder my systemd services so that my service would run last
and "win" the SSH_AUTH_SOCK
battle.
After wasting a lot of time trying to fix this the Right Way, I ended up just
giving up and putting some dirty logic in my .bashrc
:
eval $(keychain -q -Q --eval)
I felt bad though, since this is a huge hack. And it meant that my
SSH_AUTH_SOCK
wasn't getting propagated into Emacs. Since I launch Emacs
graphically, it gets the SSH_AUTH_SOCK
that's set in my systemd user session.
Emacs needs to know about the value of SSH_AUTH_SOCK
in order to edit remote
files using TRAMP. Normally I don't need to edit remote files over SSH, but
every once in a while it comes up, and when it comes up it's really helpful.
For a while I was actually avoiding Wayland precisely because I didn't want to
deal with this issue. I don't care if my computer uses X11 or Wayland, I just
want SSH to work the way it's always worked. I ran into some other bug a few
weeks ago when using the Fedora 26 Alpha, and ended up switching to Wayland as a
workaround. But I was disheartened that TRAMP no longer worked. This weekend I
found a MELPA package called keychain-environment that teaches Emacs how to
fix up the SSH_AUTH_SOCK
specifially using keychain. This worked, and I was
able to use TRAMP again. All's well that end's well, right?
Then I thought to myself: I'm a coward. This isn't honorable. What kind of
engineer am I if I can't even figure out how to set SSH_AUTH_SOCK
correctly?
Since this was an existential threat to my pride, I realized that the only thing
to do was to go code spelunking and find the evil GNOME code that was hijacking
SSH_AUTH_SOCK
.
Code Spelunking
I cloned gnome-keyring from the GNOME git, and started hacking on it. There were
a lot of dead ends here. Initially I thought my problem was caused by
pam_gnome_keyring.so
, which is this a PAM policy that causes GKD to be
launched from GDM. The code in pam_gnome_keyring.so
is hardcoded to launch GKD
using the invocation /usr/bin/gnome-keyring-daemon --daemonize --login
, and I
thought I could fix this by giving it an option to suppress the SSH component of
GKD. After many reboots and snafus, I actually got this all working, and was
able to build a version of pam_gnome_keyring.so
which has configurable
components. If you're curious what that looks like, the diff is here. But
it turns out that the GKD PAM stuff was a red herring: it's actually just used
to get your login password into GKD, and isn't related to the propagation of
SSH_AUTH_SOCK
, so this didn't fix my qualms.
After that, I commented out every line in GKD that sets SSH_AUTH_SOCK
. Surely
if SSH_AUTH_SOCK
isn't set anywhere, it can't be updated in my environment.
Yet still no dice, which was a real mystery. However, while I was working on
this, I found the code used to inject environment variables into the user
session. Here's that code:
static void
setenv_request (GDBusConnection *conn, const gchar *env)
{
const gchar *value;
gchar *name;
/* Find the value part of the environment variable */
value = strchr (env, '=');
if (!value)
return;
name = g_strndup (env, value - env);
++value;
g_dbus_connection_call (conn,
SERVICE_SESSION_MANAGER,
PATH_SESSION_MANAGER,
IFACE_SESSION_MANAGER,
"Setenv",
g_variant_new ("(ss)",
name,
value),
NULL, G_DBUS_CALL_FLAGS_NONE,
-1, NULL,
on_setenv_reply, NULL);
g_free (name);
}
The way this works is GKD has a D-Bus component that can call a D-Bus method
called Setenv
. Under the covers, this is also how systemctl set-environment
works. Many moons ago I wrote some code that listened to D-Bus events, and from
that work I know about dbus-monitor, which is a tool that helps you inspect
that's going on in D-Bus. I logged into a virtual console (using Ctrl+Alt+F3),
and then set up a pipeline like dbus-monitor | tee dbus.log
. I then switched
back to my GDM console and logged into GNOME. In my tee log I looked for all
occurrences of SSH_AUTH_SOCK
. Near the start of the log I saw the call done by
my keychain.service
file, which set the correct value. Later I saw another
call that was clobbering it with the wrong value:
method call time=1497910250.099756 sender=:1.15 -> destination=org.freedesktop.DBus serial=2 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=UpdateActivationEnvironment
array [
dict entry(
string "SSH_AUTH_SOCK"
string "/run/user/1000/keyring/ssh"
)
]
This says that the caller is :1.15
. I believe there's a way in D-Bus to get
the PID of the sender, which would allow me to correlate the call with the name
of an actual process, but I didn't need to do that. Just by searching online for
UpdateActivationEnvironment
and SSH_AUTH_SOCK
I was able to find some clues
that there might be code in gnome-session that does invokes this method. I
cloned the gnome-session code, and sure enough, I found the following:
/* hack to fix keyring until we can reorder things in 3.20
* https://bugzilla.gnome.org/show_bug.cgi?id=738205
*/
if (g_strcmp0 (g_getenv ("XDG_SESSION_TYPE"), "wayland") == 0 &&
g_getenv ("GSM_SKIP_SSH_AGENT_WORKAROUND") == NULL) {
char *ssh_socket;
ssh_socket = g_build_filename (g_get_user_runtime_dir (), "keyring", "ssh", NULL);
gsm_util_setenv ("SSH_AUTH_SOCK", ssh_socket);
g_free (ssh_socket);
}
That definitely looks suspicious. Especially since I'm using GNOME 3.24, which
implies that I don't need this code path anyway. I looked at Bug #738205
and it looks like back in the GNOME 3.14 days, the order of the initialization
of various GNOME components under Wayland was incorrect. So this really hacky
code was added to gnome-session as a stopgap to force it to propagate a value of
SSH_AUTH_SOCK
that points to GKD.
I deleted these lines of code and... success! Everything works! I can log into
GNOME without having my SSH_AUTH_SOCK
clobbered.
When I went to the GNOME Bugzilla to report the issue and send a patch, I found
that someone had already reported this in Bug #772919. If I had known this,
I could have saved a lot of time. C'est la vie. I've attached my "patch" to this
bug, but the patch isn't that interesting: it just deletes the entire "if"
branch that munges SSH_AUTH_SOCK
.
Was it worth X hours of my time to dive through all of this weird GNOME C code? Objectively, the answer is "no". I already had a hacky workaround for bash and Emacs, so I could have just lived my life as a coward. But that's no way to live life, and doing the Right Thing is much more spiritually fulfilling.