eklitzke.org

assorted things i've written…

Being Self-Taught

2015-07-30

Something I've been thinking about a lot recently is this curious aspect of the profession of programming—what is it that makes some programmers better than others? Obviously this is a complicated question, and a full analysis is outside of the scope of a short blog post (not that I have all the answers anyway).

However, one thing I've been thinking about recently is this weird thing about programming which is that it is almost entirely self-taught. A lot of the people I work with, including many of the brightest ones working on the hardest problems, don't have a formal training in "computer science". Most of them have college degrees in other disciplines. But by the same token, a lot of them don't have college degrees at all.

From talking to peple who actually did pursue a degree in "computer science" from a university, frequently they did not learn at all in a classroom about any of the following things:

To be successful in the field as a programmer you have to know how to do most or all of the above.

The thing that's really interesting to me is that despite this, it's hard for me to think of any profession that's easier to get into. In fact, it's probably more correct to say: because of this it's hard for me to think of any profession that's easier to get into. Everyone in this field is primarily self-taught.

There is more content on the internet about programming than almost any other field I can think of. If you browse the front page of Hacker News regularly you will find a constant stream of new and high quality content about interesting programming related articles. Stack Overflow is full of what I can only guess must be millions of questions and answers, and if you have a new question you can ask it. For those who take it upon themselves to learn how to use an IRC client, there are hundreds or thousands of public programming-related IRC channels where you can ask questions and get feedback from people in real time.

I frequently get questions from people who asked me how I learned about X. How did I learn what the difference is between a character device and a block device? How did I learn about how virtual memory systems work? How did I learn about databases? The answer is the same as how almost every programmer learned what they know: it's mostly self taught, I spent a ton of time reading things on the internet, and I've spent a lot of time in front of a computer writing code and analyzing the runtime behavior of programs.

Another important and related observation to this is that while there is definitely a correlation between intelligence and programming ability and success, from what I have seen the correlation is a lot weaker than you might think. There are a ton of exceptionally brilliant people that I know who work in the field of computer programming who are poor programmers or who aren't successful in their career. Usually these are people who have decided to focus their interests on something else, who have settled down into a particular niche and haven't branched out of it, or who aren't interested in putting in the hard work that it would take to improve as a programmer-at-large.

It definitely helps a lot to work with smart people. It definitely helps a lot to talk and interact with other people, and to get code reviews and feedback from them. But in my estimation, the lion's share of what there is to learn comes from the amount of work that a person spends alone in front of a computer with a web browser, a text editor, and a terminal. I hope that more people who want to become computer programmers or who want to improve take this into consideration.

Resource Management in Garbage Collected Languages

2015-07-28

I was talking to a coworker recently about an issue where npm was opening too many files and was then failing with an error related to EMFILE. For quick reference, this is the error that system calls that create file descriptors (e.g. open(2) or socket(2)) return when the calling process has hit its file descriptor limit. I actually don't know the details of this problem (other than that it was "solved" by increasing a ulimit), but it reminded me of an interesting but not-well-understood topic related to the management of resources like file descriptors in garbage collected languages.

I plan to use this post to talk about the issue, and ways that different languages work around the issue.

The Problem

In garbage collected languages there's generally an idea of running a "finalizer" or "destructor" when an object is collected. In the simplest case, when the garbage collector collects an object it reclaims that object's memory somehow. When a finalizer or destructor is associated with an object, in addition to reclaiming the object's memory the GC will execute the code in the destructor. For instance, in Python you can do this (with a few important caveats about what happens when the process is exiting!) by implementing a __del__() method on your class. Not all languages natively support this. For instance, JavaScript does not have a native equivalent.

While the Javascript language doesn't let you write destructors, real world JavaScript VMs do have a concept of destructors. Why? Because real JavaScript programs need to do things like perform I/O and have to interact with real world resources like browsee assets, images, files, and so forth. These constructs that are provided in the context like a DOM or in Node, but are not part of the Javascript language itself, are typically implemented within these VMs by binding against native C or C++ code.

Here's an example. Suppose that you have a Node.js program that opens a file and does some stuff with it. Under the hood, what is happening is that the Node.js program has created some file wrapper object that does the open(2) system call for you and manages the associated file descriptor, does async read/write system calls for you, and so on. When the file object is garbage collected the file descriptor associated with the object needs to be released to the operating system using close(2). There's a mechanism in V8 to run a callback when an object is collected, and the C++ class implementing the file object type uses to to add a destructor callback will handle invoking the close(2) system call when the file object is GC'ed.

A similar thing happens in pretty much every garbage collected language. For instance, in Python if you have a file object you can choose to manually invoke the .close() method to close the object. But if you don't do that, that's OK too—if the garbage collector determines that the object should be collected, it will automatically close the file if necessary in addition to actually reclaiming the memory used by the object. This works in a similar way to Node.js, except that instead of this logic being implemented with a V8 C++ binding it's implemented in the Python C API.

So far so good. Here's where the interesting issue is that I want to discuss. Suppose you're opening and closing lots of files really fast in Node.js or Python or some other garbage collected language. This will generate a lot of objects that need to be GC'ed. Despite there being a lot of these objects, the objects themselves are probably pretty small—just the actual object overhead plus a few bytes for the file descriptor and maybe a file name.

The garbage collector determines when it should run and actually collect objects based on a bunch of magic heuristics, but these heuristics are all related to memory pressure—e.g. how much memory it thinks the program is using, how many objects it thinks are collectable, how long it's been since a collection, or some other metric along these lines. The garbage collector itself knows how to count objects and track memory usage, but it doesn't know about extraneous resources like "file descriptors". So what happens is you can easily have hundreds or thousands of file descriptors ready to be closed, but the GC thinks that the amount of reclaimable memory is very small and thinks it doesn't need to run yet. In other words, despite being close to running out of file descriptors, the GC doesn't realize that it can help the situation by reclaiming these file objects since it's only considering memory pressure.

This can lead to situations where you get errors like EMFILE when instantiating new file objects, because despite your program doing the "right thing", the GC is doing something weird.

This gets a lot more insidious with other resources. Here's a classic example. Suppose you're writing a program in Python or Ruby or whatever else, and that program is using some bindings to a fancy C library that does some heavy processing for some task like linear algebra, computer vision, machine learning, or so forth. To be concrete, let's pretend like it's using bindings to a C library that does really optimized linear algebra on huge matrices. The bindings will make some calls into the C library to allocate a matrix when an object is instantiated, and likewise will have a destructor callback to deallocate the matrix when the object is GC'ed. Well, since these are huge matrices, your matrices could easily be hundreds of megabytes, or even many gigabytes in size, and all of that data will actually be page faulted in and resident in memory. So what happens is the Python GC is humming along, and it sees this PyObject that it thinks is really small, e.g. it might think that the object is only 100 bytes. But the reality is that object has an opaque handle to a 500 MB matrix that was allocated by the C library, and the Python GC has no way of knowing that, or even of knowing that there was 500 MB allocated anywhere at all! This happens because the C library is probably using malloc(3) or its own allocater, and the Python VM uses its own memory allocator). So you can easily have a situation where the machine is low on memory, and the Python GC has gigabytes of these matrices ready to be garbage collected, but it thinks it's just managing a few small objects and doesn't GC them in a timely manner. This example is a bit counterintuitive because it can appear like the language is leaking memory when it's actually just an impedance mismatch between how the kernel tracks memory for your process and how the VM's GC tracks memory.

Again, I don't know if this was the exact problem that my coworker had with npm, but it's an interesting thought experiment—if npm is opening and closing too many files really quickly, it's totally possible that the issue isn't actually a resource leak, but actually just related to this GC impedance mismatch.

Generally there's not a way to magically make the GC for a language know about these issues, because there's no way that it can know about every type of foreign resource, how important that resource is to collect, and so on. Typically what you should do in a language like Python or Ruby or JavaScript is to make sure that objects have an explicit close() method or similar, and that method will finalize the resource. Then if the developer really cares about when the resources are released they can manually call that method. If the developer forgets to call the close method then you can opt to do it automatically in the destructor.

C++ Solution

C++ has a really elegant solution to this problem that I want to talk about, since it inspired a similar solution in Python that I'm also going to talk about. The solution has the cryptic name RAII -- Resource Acquisition Is Initialization. In my opinion this is a really confusing name, since most people really care more about finalization than initialization, but that's what it's called.

Here's how it works. C++ has a special kind of scoping called block scoping. This is something that a lot of people think that syntactically similar languages like JavaScript have, since those languages also have curly braces, but the block scoping concept is actually totally different. Block scoping means that variables are bound to the curly brace block they're declared in. When the context of the closing curly brace is exited, the variable is out of scope. This is different from JavaScript (or Python) for instance, because in JavaScript variables are scoped by the function call they're in. So a variable declared in a for loop inside of a function is out of scope when the loop exits in C++, but it is not out of scope until the function exits in JavaScript.

In addition to the block scoping concept, C++ also has a rule that says that when an object goes out of scope its destructor is called immediately. It even further goes to specify the rules about what order the destructors are invoked. If multiple objects to out of scope at once, the destructors are called in reverse order. So suppose you have a block like.

hello();
{
  Foo a;
  Bar b;
}
goodbye()

The following will happen, in order:

This is guaranteed by the language. So here's how it works in the context of managing things like file resources. Suppose I want to have a simple wrapper for a file. I might implement code like this:

class File {
 public:
  File(const char *filename) :fd_(open(filename, O_RDONLY)) {}
  ~File() { close(fd_); }

 private:
  int fd_;
};

int foo() {
  File my_file("hello.txt");
  bar(my_file);
  return baz();
}

(Note: I'm glossing over some details like handling errors returned by open(2) and close(2), but that's not important for now.)

In this example, we opened the file hello.txt, do some things with it, and then it will automatically get closed after the call to baz(). So what? This doesn't seem that much better than explicitly opening the file and closing it. In fact, it would have certainly been a lot less code to just call open(2) at the start of foo(), and then had the one extra line at the end to close the file before calling baz().

Well, besides being error prone, there is another problem with that approach. What if bar() throws an exception? If we had an explicit call to close(2) at the end of the function, then an exception would mean that line of code would never be run. And that would leak the resource. The C++ RAII pattern ensures that the file is closed when the block scope exits, so it properly handles the case of the function ending normally, and also the case where some exception is thrown somewhere else to cause the function to exit without a return.

The C++ solution is elegant because once we've done the work to write the class wrapping the resource, we generally never need to explicitly close things, and we also get the guarantee that the resources is always finalized and that that is done in a timely manner. And it's automatically exception safe in all cases. Of course, this only works with resources that can be scoped to the stack, but this is true a lot more often than you might suspect.

This pattern is particularly useful with mutexes and other process/thread exclusion constructs where failure to release the mutex won't just cause a resource leak but can cause your program to deadlock.

Python Context Managers

Python has a related concept called "context managers" and an associated syntax feature called a with statement.

I don't want to get too deep into the details of the context manager protocol, but the basic idea is that an object can be used in a with statement if it implements two magic methods called __enter__() and __exit__() which have a particular interface. Then when the with statement is entered the __enter__() method is invoked, and when the with statement is exited for any reason (an exception is thrown, a return statement is encountered, or the last line of code in the block is executed) the __exit__() method is invoked. Again, there are some details I'm eliding here related to exception handling, but for the purpose of resouce management what's interesting is that this provides a similar solution to the C++ RAII pattern. When the with statement is used we can ensure that objects are automatically and safely finalized by making sure that finalization happens in the __exit__() method.

The most important difference here compared to the C++ RAII approach is that you must remember to use the with statement with the object to get the context manager semantics. With C++, RAII typically is meant to imply that the object is allocated on the stack which means that it is automatically reclaimed and there's no chance for you to forget to release the context.

Go Defer

Go has a syntax feature called defer that lets you ensure that some code is run when the current block is exited. This is rather similar to the Python context manager approach, although syntactically it works a lot differently.

The thing that is really nice about this feature is that it lets you run any code at the time the block is exited, i.e. you can pass any arbitrary code to a defer statement. This makes the feature incredibly flexible—in fact, it is a lot more flexible than the approach that Python and C++ have.

There are a few downsides to this approach in my opinion.

The first downside is that like with Python, you have to actually remember to do it. Unlike C++, it will never happen automatically.

The second downside is that because it's so flexible, it has more potential to be abused or used in a non-idiomatic way. In Python, if you see an object being used in a with statement you know that the semantics are that the object is going to be finalized when the with statement is exited. In Go the defer statement probably occurs close to object initialization, but doesn't necessarily have to.

The third downside is that the defer statement isn't run until the function it's defined in exits. This is less powerful than C++ (because C++ blocks don't have to be function-scoped) and also less powerful than Pyton (because with statements can exit before the calling function).

I don't necessarily think this construct is worse than C++ or Python, but it is important to understand how the semantics differ.

Javascript

Javascript doesn't really have a true analog of the C++/Python/Go approaches, as far as I know. What you can do in Javascript is to use a try statement with a finally clause. Then in the finally clause you can put your call to fileObj.close() or whatever the actual interface is. Actually, you can also use this approach in Python if you wish, since Python also has the try/finally construct.

Like with Go defer statements, it is the caller's responsibility to remember to do this in every case, and if you forget to do it in one place you can have resource leaks. In a lot of ways this is less elegant than Go because the finalization semantics are separated from the initialization code, and this makes the code harder to follow in my opinion.

My Philosophy on "Dot Files"

2015-07-07

This is my philosophy on dot files, based on my 10+ years of being a Linux user and my professional career as a sysadmin and software engineer. This is partially also based on what I've seen in developer's dotfiles at the company I currently work for, which has a system for managing and installing the dotfiles of nearly 1000 engineers.

When I started using Linux, like every new Unix user I started cribbing dot files from various parts of the internet. Predictably, I ended up with a mess. By doing this, you get all kinds of cool stuff in your environment, but you also end up with a system that you don't understand, is totally nonstandard, and is almost always of questionable portability.

In my experience this is less of a problem for people who are software engineers who don't have to do a lot of ops/sysadmin work. A lot of software engineers only do development on their OS X based computer, and possibly a few Linux hosts that are all running the exact same distro. So what happens is if they have an unportable mess, they don't really know and it doesn't affect them. That's great for those people.

When you start doing ops work, you end up having to do all kinds of stuff in a really heterogenous environment. It doesn't matter if you work at small shop or a huge company, if you do any amount of ops work you're going to admin multiple Linux distros, probably various BSD flavors, and so on. Besides that (or even if you have a more homogeneous environment), you end up having to admin hosts that are in various states of disrepair (e.g. failed partially way through provisioning) and therefore might as well be different distros.

Early on, the (incorrect) lesson I got out of this was that I needed to focus on portability. This is really hard to do if you actually have to admin a really heterogeneous environment. For a few reasons. For a starter, even the basic question of "What kind of system am I on?" is surprisingly hard to answer. The "standard" way to do it is to use the lsb_release command... but as you would guess, this only works on Linux, and it only works on Linux systems that are recent enough to have a lsb_release command. If you work around this problem, you still have the problem that it's easy to end up with a huge unreadable soup of if statements that at best is hard to understand, and frequently is too specific to really correct anyway. You might think that you could work around this by doing "feature testing", which is actually the right way to solve the problem, but this is notoriously hard to do in a shell environment and can again easily make the configuration unreadable or unmaintainable.

It gets even worse for things like terminal based emulators. The feature set of different terminal emulators like xterm, aterm, rxvt, and so on varies widely. And it gets even more complicated if you're using a "terminal multiplexer" like screen or tmux. God forbid you try to run something in a vim shell or Emacs eshell/ansi-term. Trying to detect what terminal emulator you're under and what features it actually supports is basically impossible. Even if you could do this reliably (which you can't because a lot of terminal emulators lie), the feature set of these terminal emulators has varied widely over the years, so simply knowing which terminal emulator you're using isn't necessarily enough to know what features it supports.

As I became a more seasoned Linux/Unix user, what I learned was that I should try to customize as little as possible. Forget those fancy prompts, forget the fancy aliases and functions, and forget the fancy 256-color terminal emulator support. The less you customize the less you rely on, and the easier it becomes to work on whatever $RANDOMSYSTEM you end up on. For a number of years the only customization I would do at all was setting PS1 to a basic colorized prompt that included the username, hostname, and current working directory—and nothing else.

Recently I've softened on this position a bit, and I know have a reasonable amount of configuration. In the oldest version of my .bashrc that I still track with version control (from 2011, sadly I don't have the older versions anymore), the file had just 46 lines. It has a complicated __git_ps1 function I cribbed from the internet to get my current git branch/state if applicable, sets up a colorized PS1 using that function, and does nothing else. By 2012-01-01 this file had expanded to 64 lines, mostly to munge my PATH variable and set up a few basic aliases. On 2013-01-01 it was only one line longer at 65 lines (I added another alias). On 2014-01-01 it was still 65 lines. At the beginning of this year, on 2015-01-01 it was 85 lines due to the addition of a crazy function I wrote that had to wrap the arc command in a really strange way. Now as I write this in mid-2015, it's nearly twice the size, at a whopping 141 lines.

What changed here is that I learned to program a little more defensively, and I also got comfortable enough with my bash-fu and general Unix knowledge. I now know what things I need to test for, what things I don't need to test for, and how to write good, portable, defensive shell script. The most complicated part of my .bashrc file today is setting up my fairly weird SSH environment (I use envoy and have really specific requirements for how I use keys/agents with hosts in China, and also how I mark my shell as tainted when accessing China). Most of my other "dot files" are really simple, ideally with as little configuration as possible. Part of this trimming down of things has been aided by setting up an editor with sensible defaults: for real software engineering stuff I use Spacemacs with a short .spacemacs file and no other configuration, and for ops/sysadmin stuff I use a default uncustomized vi or vim environment.

Which brings me to the next part of this topic. As I mentioned before, the company I work at has nearly 1000 engineers. We also have a neat little system where people can have customized dot files installed on all of our production hosts. The way it works is there's a specific git repo that people can clone and then create or edit content in a directory that is the same as their Unix login. The files they create in that directory will be installed on all production hosts via a cron that runs once an hour. A server-side git hook prevents users from editing content in other user's directories. This system means that generally users have their dot files installed on all hosts (with a few exceptions not worth going into here), and also everyone can see everyone else's checked in dot files since they're all in the same repo.

People abuse this system like you would not believe. The main offenders are people who copy oh-my-zsh and a ton of plugins into their dot files directory. There are a few other workalike systems like Bashish (which I think predates oh-my-zsh), but they're all basically the same: you copy thousands of lines of shell code of questionable provenance into your terminal, cross your fingers and hope it works, and then have no idea how to fix it if you later encounter problems. Besides that, I see a ton of people with many-hundreds-of-lines of configuration in their bash/zsh/vim/emacs configuration that are clearly copied from questionable sources all over the internet.

This has given me a pretty great way to judge my coworkers' technical competency. On the lowest rung are the gormless people who have no dot files set up and therefore either don't give a shit at all or can't be bothered to read any documentation. Just above that are the people who have 10,000 lines of random shell script and/or vimscript checked into their dot files directory. At the higher levels are people who have a fairly minimal setup, which you can generally tell just by looking at the file sizes in their directory.

If you want to see what differentiates the really competent people, here are a few things I sometimes look for:

LaRouche PAC

2015-07-06

So the LaRouche PAC has been posted up the last few weeks at 3rd & Market, with a bunch of aggravating signs about Greece, Glass-Steagall, and so forth. In classic LaRouche fashion, they have these stupid posters with Obama sporting a Hitler moustache and other nonconstructive inflammatory things designed to provoke people and pull them into discussions/arguments so they can espouse their conspiracy theories.

I had a friend in college (UC Berkeley) that joined and left the LaRouchies multiple times, and dropped out of school as a result. They are a really scary cult. I want to write about that story, so people realize how fucked up they are.

Basically what would happen is my friend would start going to the LaRouche discussion meetings in the evenings that are open to the public, and would get really cauhgt up in that. Then he'd start also going on their weekend retreats from time to time. The way these work, you go out to some cabin or something in the middle of nowhere, you don't have your cell phone or a way to contact the outside world, and they have this highly structured schedule where you talk about LaRouche politics and ideology all weekend for 16 hours a day. Then since he was spending all of this time in the evening meetings and weekend retreats, he'd stop going to classes and would spend time canvassing with the group. At one point, he had effectively moved out of the Berkeley co-op he was in and was living in some house that they had somewhere in Berkeley or Oakland that they let people stay in who are sufficiently dedicated to the cause (and who are spending some minimum amount of time canvassing, donating money, or doing who knows what else).

I remember the first time he joined the group. At first he was telling me about this cool group that had these interesting math/geometry discussions, and didn't mention it was the LaRouche movement. Maybe he didn't even know at first. Apparently there's some sort of shadow organization where they'll do these tracks focused on less political stuff like math/philosophy, and then they try to use that to get you to start going to the more "philosophy" focused discussions, and then that leads into the actual political arm of the organization which is their real interest. He'd be telling me about how they were doing these interesting Euclidean geometry discussions, and talking about logic, and somehow this math/geometry thing was related to the general concept of rationality, reasoning, and higher level thought and philosophy. Anyway, I was like "yeah that sounds cool maybe I'll check it out some time" and never went for whatever reason, I guess just because it sounded too weird to me. Then over the course of the semester, he started telling me about more of the stuff they were discussing, and started getting into the politics of it. At the time I was fairly up to date with what was going on with national and international politics, but not nearly as knowledgeable as someone who spends all day reading/talking about this stuff, so we'd get into these discussions where he'd be espousing these weird views about whatever was the topic of the moment and I would just be like "OK, whatever, clearly I don't know as much about this issue as you but I still disagree—I'm not going to debate you on this, let's talk about something else."

Then basically the last time I hung out with my friend that semester, we were walking around talking about stuff, and he started telling me this really crazy shit about how Lyndon LaRouche actually was controlling the Democratic Party, and somehow also had Congress and the GOP under his thumb, and all of this really out there stuff. Lyndon LaRouche would issue these internal memos where he'd be predicting various political/economic events, most of which either sounded to me very vague or were not substantiable. From my point of view, the things he predicted correctly would be used to "prove" that he was controlling the Democratic Party or whatever, and then for the stuff that didn't come to fruition there would be some excuse about how something had changed at the last minute and LaRouche had to steer in a different direction. My friend didn't realize how bullshit this was. I was just like WTF I don't even know how to explain how crazy this is, and didn't really see him for a few weeks after that.

The next time I saw him was during the finals week of that semester. I was going down into the UC Berkeley main stacks (read: huge campus library) to do some studying or whatever. So I randomly run into him, and we're talking and he asked me what I was doing down there. So I was like "Uh..... I'm studying for finals... why else would I be down in main stacks during finals week?" and during the ensuing discussion I realized that he didn't even know that it was finals week at school, and had completely stopped going to classes or following anything related to actually being a student at UC Berkeley.

What happened is he failed all of his classes that semester and was put on academic probation. His parents found out, because they were paying his tuition and rent and whatnot. They found out about the LaRouche stuff and freaked out, and they got him to take a semester off of school, live at home with them, and they got him out of the cult. He basically came to his senses, realized that the LaRouche thing was ruining his life, and decided to quit the movement and go back to school again.

The next semester that he was actually back in school we start hanging out again and he filled me in on what happened, how the LaRouche movement is a cult (duh, I had already figured it out by this point), and all of that stuff. But these people from the LaRouche movement kept calling him. We'd be hanging out and he'd get a phone call and be like "hold on, I need to take this", and then he'd spend an hour talking to the person about how he had left the group and wasn't interested in going to their meetings. I don't know why he didn't just stop taking the phone calls, or hang up immediately, but somehow he'd always get dragged into a long discussion.

Predictably what happened is at some point he ended up going to one of their meetings, didn't tell me or his parents about it, and got dragged right back into the cult. Then he stopped going to his classes again and cut off contact with me and the other few friends he had (although looking back, I think I might have been his only friend outside of the LaRouche movement at the time). Then he got kicked out of school since he had failed all of his classes one semester, took a semester off, and then failed all of his classes again. His parents found out and lost their shit again. My friend and his parents were from Bulgaria (I think he had come over to the United States in middle school), and they got him to somehow move to Bulgaria so that he could actually get the fuck out of the LaRouche cult and try to get a job or go to school there. I'm not really sure of the details because he had deleted his Facebook (or maybe that's when I had deleted mine), so I didn't really keep in touch. I did hear a few years later that he was still in Bulgaria, so I think things worked out.

tl;dr Fuck the LaRouche movement. It's a fucked up cult. Do not try to talk with them or engage with them, it's a waste of your time. The best thing you can do is ignore them, and if you see anyone reading their literature tell them they're a fucking cult. There's a bunch of this stuff documented on the internet. Usually the Wikipedia articles are informative on the matter, but the LaRouchies have been in a multi-year edit war with Wikipedia trying to remove any damaging facts about their organization, so what's on Wikipedia is not necessarily trustworthy at any given moment.

An Important Difference Between mysql(1) and MySQLdb

2015-05-27

I keep forgetting about this thing, and then every six to twelve months when I have to do it again, I waste a bunch of time rediscovering it. It's important enough that I'm going to blog it.

If you're used to using PostgreSQL, you'll know that with Postgres you can connect over the local AF_UNIX socket using peer authentication. This means that as the evan user I can automagically connect to the evan database without a password. Likewise, to become the Postgres superuser, I simply need to do sudo -u postgres psql. This works using some magic related to either SO_PEERCRED or SCM_CREDENTIALS which let you securely get the credentials of the other end of a connected AF_UNIX socket.

MySQL also has a local AF_UNIX socket, and you can use this socket to make connections to MySQL. This is pretty handy, and for many reasons you may prefer to connect to MySQL over the local socket rather than using a TCP connection to localhost.

However, MySQL does not do the peer authentication thing. It doesn't matter if you're the root user connecting over a local socket. If the root user is configured to require a password (which is what I strongly recommend), then you must supply a password, even if you have sudo privileges on the host.

Fortunately, there's an easy workaround here that prevents you from having to type the root password all the time if you're doing a lot of MySQL administration. When you use the mysql CLI program, it will look for a file called ~/.my.cnf and use it to look up various connection settings. In particular, in this file you can set a default user and password. So let's say you've done this nice thing and made a file called /root/.my.cnf that has the root user's MySQL credentials, and you have the file set to mode 600 and all that and everything is great. You can type sudo mysql and you won't have to supply the root MySQL password (just possibly the root sudo password).

Here is a really important thing to know: the behavior of reading ~/.my.cnf is something that the mysql CLI program implements, it is not something implemented by libmysqlclient.so!

What that means is that when you are writing some script to frob MySQL using Python and MySQLdb, this will not work:

conn = MySQLdb.connect(unix_socket='/run/mysqld/mysql.sock',
                       user='root')

You might think that if you ran this script as the root user, it could authenticate. Not so. Instead what you want is this:

conn = MySQLdb.connect(unix_socket='/run/mysqld/mysql.sock',
                       user='root',
                       read_default_file='/root/.my.cnf')

By the way, using the read_default_file option like this is definitely the best way to authenticate to MySQL from Python in general. You should not be putting database passwords in your Python projects---neither in your source code, nor in your project configs. By using a file in the filesystem like this you can move all of the database credentials into Puppet/Chef/whatever and secure the files so that most users can't read them. It may not seem like a big win today, but a few years later, when you're given the task of auditing everything for passwords, knowing that passwords have only lived in your configuration management software is going to help a lot.

How To Be An Urban Cyclist—Part 1

2014-11-02

This blog series is going to explain to my advice on being an urban cyclist. The difficulty I've seen with other people is that while a lot of people know how to ride a bike, they may not feel comfortable riding in heavy traffic, on poorly paved roads, or in poorly lit ares. These posts are based on my experience the last six or seven years of my life cycling mostly around Berkeley, Oakland, San Francisco, and Los Angeles.

The first post in the series will cover what kind of bicycle I recommend, and what kind of gear you need to ride.

First you should have a well maintained bicycle. If you're buying a new bicycle, I strongly recommend getting a road bike with drop bars rather than a cheapo mountain bike. Road bikes are simply a lot faster, and if you don't feel fast you're not going to want to bike. There's nothing more frustrating than seeing people whiz by you on their bikes while you're struggling on yours. Simply put: if you don't feel good on your bike, you're not going to use it.

You can get a decent used steel frame bicycle in the Bay Area for $500-$600 or cheaper, depending on exactly what size frame you need, what type of components you want, etc. If you live elsewhere, you can probably get one cheaper. A decent new road bike will be something like $1000 or more if you want to get really fancy. If you're buying a new bike, I'm a big fan of Surly Bikes, but there's nothing wrong with getting a used bike. If you get a used bike, make sure you ride it and test that it can shift smoothly and brake quickly.

You should get and wear a helmet. You'll easily exceed speeds of 20 mph on your bike, and even in dense urban areas cars frequently exceedp speeds of 30 mph or more. For a comparison, falling off the top of a two story building entails an impact of about 20 mph. At 20 mph, much less at higher speeds, you can very easily die in a head on collision.

Next, make absolutely sure that you have both a front and rear light if you're going to light in any kind of low light conditions. Riding in the dark without a light is incredibly dangerous, because you'll be moving quickly, be hard to see, and be making very little noise. I like the silicone lights that don't require any mounting gear that you can put on and off your bike really easily (mine are "Snugg" brand and cost $15 for a pair on Amazon). These are great for riding around and being seen. However, they're not going to illuminate the road in front of you. If you plan on biking in really dark areas you'll want a bigger/brighter clip on light—I'd recommend the ones that are 1 watt or higher power output (most of the ones in the store will be 0.5 watts, which isn't ideal). Make sure you always remove your lights when locking your bike outdoors.

For locks, at the minimum you need a U-lock and cable lock.[1] The U-lock will lock your rear wheel and frame, the cable will lock your front wheel. Note that all of the cables you buy can be cut fairly easily (in a few minutes perhaps); the point of the cable is to deter someone from stealing the front wheel (which is fairly cheap), the U-lock is what will actually be securing your frame. I highly recommend the 5" Mini Kryptonite U-Lock. The 5" locks are not only the smallest ones, but they're also the most secure. U-locks can be easily broken by someone with a jack, if there's enough space to get the jack in between the bars of the lock to bend it. The 5" locks don't admit enough space for someone with a jack to get a hold on the lock. However, you'll really need an adequate rack to lock your bike with a 5" lock. For instance, it's generally not possible to lock your bike to a parking meter with a 5" lock whereas you can with a larger size. When you lock your bike, you need to place the U-lock so that it secures the rear wheel through the rear triangle of the bike. You generally should not directly lock the frame. By locking the rear wheel through the rear triangle, the U-lock is actually going through both the frame and the rear wheel (although it may not look like it!). The cable loops through the front-wheel and back around the U-lock.

In areas with high rates of bike theft, such as San Francisco, you'll need some way to secure your seat as well. I biked and locked my bike outdoors for years in Berkeley, Oakland, and Los Angeles and never had a problem with seat theft. As soon as I started biking in San Francisco, I got my seat stolen twice in the course of a month (both times having left the bike alone for less than an hour). So whether or not you need this really depends on where you live. Bike stores will sell special locks for seats. You can keep the lock on the seat all of the time because you'll only need to remove it in the rare situations when you need to adjust the seat height. If you don't like the look of a seat lock, or want to spend less money, you can also try securing the seat post bolt by using security bolts or hot gluing a BB into the bolt head.

If you're going to ride in the rain, I strongly recommend a detachable rear fender. Otherwise you're going to get a muddy butt. I've never found a front fender to be necessary; if it's rainy enough to need one, you're going to get drenched anyway.

[1] If you have security bolts for your front wheel, you can probably omit buying and carrying a cable lock.

Apollo Brown—Thirty Eight

2014-05-04

This album is sick.

On Not Having A LinkedIn Account

2014-04-30

I don't have a LinkedIn account, which some people find to be a bit strange. I'd like to talk a bit about that.

As a software engineer with an awesome job, I really do not need a constant barrage of recruiter spam. Here are the specifics:

My experience with LinkedIn is I'd get a torrential inflow of recruiter spam (i.e. "Join our HOT VC-backed stealth startup!!!") that wasn't useful to me at all.

Worse, I found that some people would "stalk" me on LinkedIn before coming in for job interviews. As in, I'd go in to a job interview, and someone would mention something about my past that they had looked up on LinkedIn. This has happened once with my Twitter account too, which is even creepier.

Since LinkedIn provides no value to me and is yet-another-way-to-track-me, I don't have an account with them. EZPZ.

Final Conflict—Ashes To Ashes

2014-04-24

I was recently turned on to Final Conflict's seminal album Ashes To Ashes from this Pitchfork album review. The album review made the album sound awesome, and I'm pretty into some of the other acts from the 80's LA/OC hardcore scene (e.g. Black Flag, TSOL, Adolescents), so I had to check it out.

Put simply, this album is fucking great. I personally have a strong preference for the hardcore sound (i.e. compared to thrash/black/heavy metal) because that's the shit I grew up on, so even though the whole scene was a bit before my time I get nostalgic for it. That said, there are some pretty prominent metal influences in this album that clearly place the album in the late 80s. For instance, the track Abolish Police features some awesome wailing guitar sections not as common in the earlier hardcore stuff (but seen for example in the later Black Flag material). Some of the tracks like Shattered Mirror strongly evoke the sound of some other LA/OC acts like TSOL or Adolescents; in particular, this track reminds me of some of the tracks from the Adolescents' debut album. There are some awesome samples of Reagan-era political speeches on tracks like Political Glory and The Last Sunrise.

tl;dr if you're into hardcore stuff, check this album out.

RSS

2014-04-23

I added an RSS feed to this "blog", again using Python's excellent lxml module. This ended up being really convenient because of the way I was already using lxml to generate the articles from markdown. There's a method .text_content() on the lxml html nodes, so I can already take the marked up article content and extract the text content from it. Thus, the generator script (lovingly called generate.py) ends up being a thin wrapper that generates HTML from the markdown files, then does some element tree gymnastics, and magically out of this comes an HTML element tree that's rendered as the blog content itself, and an RSS element tree.

tl;dr magic happens and then blog.

Cloud Nothings—Attack On Memory

2014-04-23

Right now this album is my favorite thing. Especially the first two tracks, holy shit.

Deafheaven—Sunbather

2014-04-17

Those of you who know me well know that while my music interests are varied, lately (as in, the past few years) I've mostly been listening to hip hop music. I wanted to do a review of a new album I've been really into lately that isn't a hip hop album. That album is Sunbather by Deafheaven.

I found this album somehow by stumbling across links on Pitchfork. I think I was checking out some bands I had found on Vimeo, and a Deafheaven link came up at the bottom of one of the pages. Anyway, I saw the Pitchfork album review, saw that it was rated well and read the album description, and I decided to check out the album. It's an incredibly easy album to get into because the opening track, Dream House, is so powerful. It's very atmospheric with fast-paced guitars and percussion, and very emotive-but-subdued "screamo" vocals. The next track, Irresistible, blends in perfectly with the first track and provides a really nice contrast; it is a very melodic entirely instrumental track. The album generally follows this pattern of a long black metal/emo/screamo track usually followed by a shorter more melodic track.

I can't really do the full album review the same justice as the experts can, so I refer you to the already linked Pitchfork review, as well as The Needle Drop's album review.

What I really love about this album is how accessible and melodic it is, and yet how emotive and powerful a lot of the tracks are. I don't listen to a lot of black metal (which is I guess how the band labels themselves), and I think black metal is generally a somewhat inaccessible genre for outsiders. Yet I was able to pick this album up really easily. This may because the album is non-traditional to the genre, but I like it.

I'm especially excited because I'm attending the Pitchfork Music Festival in July, and I found out (having already bought tickets) that Deafheaven will be performing there. I'm looking forward to seeing them live!

Hello World

2014-04-16

I made a simple static site generator for my new blog incarnation. The generator works using Markdown and lxml to generate sites. I am not using any normal templating tools like jinja or mustache.

Since I think it's kind of interesting, articles are structured as separate files in a directory, and an article itself looks like this:

metadata, e.g. the article date
more metadata

blog content starts here

In other words, there is a preamble section of metadata, a blank line, and then the actual markdown text. I parse the metadata, generate HTML using the Python markdown module, and then transform that into an lxml element tree. The lxml element tree is munged to insert the metadata (e.g. the article date).

I decided on this format because

Mostly I intend on using this space to talk about music, bicycles, computers, life, work, and all of that good stuff.