General musings on programming languages, and Java.

Monday, June 25, 2007

Book Review - Release It! Design and Deploy Production-Ready Software

Release It! Design and Deploy Production-Ready Software, by Michael Nygard (Pragmatic Programmers) You know the drill. Here's a sample case of how something went wrong in production. The author needs to pad out the book, let's make the case more convoluted. Maybe include some code. A book written in dry technical English, painful to write and worse to read. That's what I expected when I received this book, except for the 'Pragmatic Bookshelf' logo on the front. I was pleasantly surprised; none of the above apply. The book took me quite some time to get through, not because it's padded out, but because it's dense. Despite being written in plain (and sometimes overly American) English, a lot of information is presented, and very readably. The author has clearly spent a lot of effort in making sure that the book doesn't patronise, yet does explain its terms clearly. Even though I'm not exactly the target market for this book, as it is aimed at enterprise developers (the author defines enterprise as systems that cost money whenever they go down), I enjoyed it, and I can apply quite a lot of it to my lowly non-enterprise project. There are a lot of references to Java in the book, but I don't think that should deter those who don't use Java, as the same points can be applied to other technologies. Rarely does it go deeply into Java-specific problems. Stop Waffling, And Tell Me What The Book's About Stability, capacity and operations. The stability part covers ways of making sure that errors in software don't spiral out of control. Even where there are failover subsystems in place, and everything seems isolated from everything else, there may be a hidden route of chaos, such that when one system goes down in a certain way, all the rest do. The 'Circuit Breaker' pattern will help to stop that. Capacity has a lot more human factors in it - how many sessions can your server support? Can you make sessions shorter in the event of a surge, saving RAM but annoying users? Does your testing environment match the production environment closely enough? Can you get away with static content for those users who are browsing, not buying? Do you really need pretty-printed HTML, adding 50K to data transfers? Operations is about ensuring that the administrators have an easy time of it, that you can have meaningful logs that are useful, that you can inspect (and even tinker with) a running application, throttling backups so that capacity isn't affected, and generally making sure that you can. A lot of the book may seem intuitive, but it's good to see your intuitions formalised, and taken a few steps further. Some programmers reading this book will be screaming out things like "use Erlang and this problem disappears", etc., and they'll often be right, but it's useful to understand a problem, even if your environment makes it impossible. Further, many problems will out in other ways even if an environment prevents the major cause. So I'd recommend this book to any programmer - learn from Michael's mistakes, and the mistakes of those around him, then hopefully your own mistakes won't be so costly. I'm glad that I keep the review copy, as I will be looking back through it again. Plus, I read some of it in the bath, so it's got a few crinkled pages - I don't fancy posting it back to anyone! The stories about what happened in production are always real, though the names are changed, and the author builds up a real sense of suspense that I've not seen often in technical books. The problems that he and his colleagues find are often hilarious, but the lessons drawn from them are important. I've often found, as I said at the start of this review, that stories in technical books are very dry, very boring and just padded out, but the author has enough writing skill, and great raw material, so that hasn't happened. To give you a flavour of how the book is written, here are a few randomish paragraphs:

Avoid fiddling Human intervention leads to problems. Eliminate the need for recurring human intervention. Your system should run at least for a typical deployment cycle without manual disk cleanups or nightly restarts.
It seems like common sense, but there are plenty of systems that require routine restarts. Common sense just isn't that common, it seems.
For example, I've seen badly configured proxy servers start re-requesting a user's last URL over and over again. I was able to identify the user's session by its cookie and then trace the session back to the registered customer. Logs showed that the user was legitimate. For some reason, fifteen minutes after the user's last request, the request started reappearing in the logs. At first, these requests were coming in every thirty seconds. They kept accelerating, though. Ten minutes later, we were getting four or five requests every second. These requests had the user's identifying cookie but not his session cookie. So, each request was creating a new session. It strongly resembled a DDoS attack except that it came from one particular proxy server on one Navy base.
Would you have thought of that in advance? I wouldn't, I'd have just blocked the user automatically, assuming malice.
Amazon ran into trouble with the Xbox 360, too. In November 2006, Amazon decided to offer 1,000 units for just $100. News of the offer spread far and wide. Not surprisingly, the 1,000 units sold within five minutes. Unfortunately, nothing else sold during that time, because millions of visitors hammered on their Reload buttons, trying to load the special offer page and score a huge discount on the hot console.
The context here is attacks of self-denial, where systems get attacked by themselves, by a special offer attracting an incredible surge of page requests. Again, I'd recommend this to any programmer, and to any administrator who deals with bespoke solutions. Perhaps if all you do as an administrator is install/configure applications for which a support contract exists, and you can afford the time between a problem and the support engineers appearing, then you won't benefit from this book. But then surely you're only 5 minutes away from being replaced by a shell script.. The one missing part, for me, is a discussion of how to go about designing a system from scratch - all the case studies describe existing systems, and often only a small part of them - I'd like some suggestions on how to write a system that will be able to scale easily - and how to avoid writing systems that won't.

No comments:

Blog Archive

About Me

A salsa dancing, DJing programmer from Manchester, England.