Improving the Pi Searcher's speed by moving from C++ to Go

From the title, I hope you expect that there's a trick here, and, of course, there is.  I'll get to that in a second.  None of this is new - it's the same observation about the overhead of fork/exec in CGI that things like FastCGI were designed to address.  The reason I'm scribbling about it was the way in which moving to a "slower" language actually made it easier to use the right system architecture for better overall performance.

I created the Pi Searcher as a joke in 1996 or thereabouts.  In those days, we were still finding out about new and cool websites by word-of-mouth, and a guy named Arthur Bebak ran an email list called Netsurfer Digest.  In it, he spread news about cool new things happening on this "web" thing.  In one of his issues, he noted a ridiculously useless website, and asked, "What's next?  A site where you can search for your birthday in Pi?".

It had to happen.

Over the years, the Pi searcher has grown, adopting a series of algorithmic improvements to its searching mechanism:  First, a basic linear search (strstr() style) in an ASCII file of the digits of Pi.  That got too slow, so I stored the digits as binary-coded decimal packed using four bits for each digit.  I wrote a fast searcher that compared multiple digits at a time.  And then, watching my poor server strain under the load of Pi day one year, I finally moved to a suffix array index so that searches would take logarithmic time instead of linear time.  I rewrote the code in C++ a few years ago to clean it up, but it's pretty unchanged other than that.  As you might guess, it's pretty crufty.

I decided it would be fun to do a Google instant search-style version of the Pi searcher (still kinda beta and has some javascript bugs), but I was scared again of the load:  The Pi searcher runs as a CGI program invoked by Apache.  There's one fork/exec program spawn for every search request.  It now runs on an EC2 small instance, which isn't a very hefty machine.  Generating a search request for every character someone types in the search box looked like, on Pi days, it would destroy my machine.  It can only handle about 40 queries/second with the CGI model.

Enter Go.  I rewrote the Pi searcher from scratch using Go's net/http to be a persistent server.  I threw away my very fast, optimized search code and implemented some simpler ones.  I used lots of Go libraries - the json encoder, to handle the binary search on the suffix array, etc.  Instead of doing my own binary search, the code to find the start of a search string in the suffix array now looks like this, with the code basically taken from the way it's done in the Go suffix array search implementation (I had to write my own because I still store the digits packed two digits per byte and decode them on the fly, in order to save precious memory on EC2):

   i := sort.Search(int(pisearch.numDigits), func(i int) bool {
          return pisearch.Compare(pisearch.idxAt(i), searchkey) >= 0

[Edit:  I've released the source code on github.  Happy pi searching.]

Let's not talk about how many bugs I'd introduced through various errors in my original C++ version of the packed suffix array search.   The results, in total wordcount lines of code, comparing the Go version from the JSON-outputting hacked up C++ version:

FunctionC++ Version LoCGo version LoC

And remember that the new one is a standalone web server, not just a forked program.  Both implement the same core functionality:
  1. mmap the pi file (200MB)
  2. mmap the index file (800MB)
  3. binary search in the index to locate the start and end of the result range
  4. locate the lowest-position entry in the range that satisfies the search criteria
I didn't want to ditch Apache, so I set things up with Apache simply forwarding the new searcher URL (linked with a sample query) on to the persistent Go program.  This cuts its raw speed in half, but the performance is still over an order of magnitude better than the fork/exec version, despite not implementing any of the more serious search optimizations in the Go version (and having written it in Go. :-):

Apache bench results:
C++ fork/exec version:
Requests per second:    45.76 [#/sec] (mean)
Time per request:       21.851 [ms] (mean, across all concurrent requests)

Go persistent version going through Apache reverse proxy:
Requests per second:    607.67 [#/sec] (mean)
Time per request:       1.646 [ms] (mean, across all concurrent requests)

Update May 21, 2013:  Go1.1:  I recompiled the Pi Searcher with the just-released go1.1 code, and things are slightly faster now:
requests per second:    823.89 [#/sec] (mean)

Woo, free speedup!

Is this anything I couldn't have done in C++?  No, but writing persistent, efficient servers in C++ when you really want them to keep running forever takes work.  More work than I wanted to throw at the Pi searcher this year (it's a work of love and humor, not a business).  But doing it in Go was fun, reduced the software engineering burden substantially, and I'm pretty sure it's not going to crash or run out of memory on me, which is more than I can say about the C++ version.  That difference in turn translated to a big picture speed difference of an order of magnitude, despite the actual low-level search code being (substantially) slower.

Popular posts from this blog

Stealing Google's Coding Practices for Academia

Reflecting on CS Graduate Admissions

Minting Money with Monero ... and CPU vector intrinsics