I’m pleased to say that my network timeout patch has been accepted into the go-beanstalk package this week.
What I really like about beanstalkd is its simplicity. It compiles to a small single binary and provides a super fast in-memory queue service that also persists jobs to disk allowing a server to be restarted without losing the jobs on the queue.
Over the last 8 years I’ve used it at my job as the central part of our event processing system, putting 10s of millions of jobs through it each day.
Its been rock solid, and even today, the load it generates is negligible.
Originally when I started using beanstalkd, I was still writing in PHP and so used the pheanstalk client library. This worked well, but as I began to move to using Go for my applications, I also needed to change to a Go based client library.
Luckily the author of beanstalkd also provides a Go client in the form of go-beanstalk.
Long lived connections, but no timeouts
Unlike PHP processes, Go processes tend to me long lived and so they keep their connections to beanstalkd open for longer.
This has caused some problems in production when the network itself experiences a problem (such as packet loss or high latency due to equipment failure or maintenance).
Our Go processes connect to beanstalkd and begin a long-poll waiting for new jobs to be delivered using the
reserve-with-timeout command. This will return within a certain period of time if no job is available.
However this timeout is controlled by the server. I.e. the command tells the server how long to wait and if that time is elapsed, will tell the client there are no jobs available.
This approach causes problems when the network itself has an issue, as the client will never get the response from the server saying the timeout has elapsed. We found that if the TCP connection stalled for any reason, our consumer processes would hang indefinitely. Not good!
Additionally, if the server was unreachable, the Go beanstalk client would hang trying to make the initial connection.
I decided to fix both issues.
My initial approach to solving the read hang issue was to implement dynamic read timeouts in the go-beanstalk package. My assumption was that if the reserve job command failed, the user of the client would destroy it and reconnect to try again.
However after putting the code up for review, it was explained to me that we could not rely on all users doing this, and that my initial approach may lead to commands being processed out of order.
Instead it was suggested I add TCP Keepalive support to the package.
TCP Keepalives are an OS level feature that can be enabled on TCP connections such that OS will send “ping-like” packets back and forth over the connection (transparently to the application using the socket). If the connection is detected to be faulty due to lack of response packets arriving, then any operation attempted by the application on the socket (such as a read or write) will fail.
By taking this approach, the change set size was actually smaller and worked even if the user tried to re-use a faulty connection after the initial timeout, as they would continue to get an error.
I am happy to say that this patch has now been merged (thanks to Keith Rarick), meaning that my own beanstalkworker package can now be modified to use the upstream go-beanstalk client.