This is a read along for Hacker School’s paper of the week series for The Chubby Lock Service for Loosely-Coupled Distributed Systems.
This is one of my favorite papers! A lot of their decisions were made so that their service would be useful for humans, as opposed to for some technical reason. The entire reason they built a lock service instead of a library was so that it would be easy to use, and you can see that thinking reflected at multiple levels.
And when they did technically sound things that humans could screw up, humans screwed it up:
Originally, we did not appreciate the critical need to cache the absence of files, nor to reuse open file handles. Despite attempts at education, our developers regularly write loops that retry indefinitely when a file is not present, or poll a file by opening it and closing it repeatedly when one might expect they would open the file just once.
At first we countered these retry-loops by introducing exponentially-increasing delays when an application made many attempts to Open() the same file over a short period. In some cases this exposed bugs that developers acknowledged, but often it required us to spend yet more time on education. In the end it was easier to make repeated Open() calls cheap.
Designing APIs for humans and not machines is important! And they’re willing to accept that.
A co-worker of mine designed another internal doodad for Google, and in order to prevent developer misbehavior he prints out warning messages; in one case, a screen full of text explaining how to avoid the issue.
He sometimes tells a story about people who email him, asking for the warning to be removed, because it’s filling up their screen with so much stuff. He tells it with an “oh, developers” tone of voice. It’s a good story; really funny. But the reality is that if the path of least resistance is do something technically “bad”, a significant percentage of people are going to do that “bad” thing.
There are other interesting bits, like the indications that it’s ok to do things on slow timescales. Lease durations are 12s, and can get extended to 60s under load. Failovers take, on average, 14s, etc. There are a lot of little gems about the practical aspects of implementing a distributed system, like the part about how Chubby makes a better internal name service than DNS because Chubby’s consistency-based model scales better than DNS’s timeout-based model (for some workloads), but who builds systems are forced to learn, one way or another.
The acknowledgment of the design of everyday APIs is something people can avoid indefinitely; I love that it’s at the forefront of this paper.
This whole thing was written in about 15 minutes, so there are probably typos and other errors. Feedback welcome!
Thanks to Jeshua Smith for catching a typo.