2009/03/09

That Russian who never sleeps

Many years ago, yours truly was remembered (fondly, I hope) by some people in Jakarta, Indonesia as "that Russian who never sleeps". This is how it happened.

A company I worked for was a vendor of telecom billing and customer care solutions. At that point in time, we have just produced a brand new version of the product, and our gig in Indonesia was basically the first attempt to implement it in production. As if that didn't sound scary enough already, one of the centerpieces of the solution was a prepaid pricing engine. Newly minted, as was the rest of the system. The difference between postpaid and prepaid billing is terrifying. If something goes wrong with postpaid billing, you can always roll it back, fix the problem and rerun the file with the call records again. If prepaid billing misbehaves, within minutes several million people go incommunicado.

So, for several months before going live, we (the implementation project team) tested the heck out of the product, from all sorts of angles, placing special emphasis on the prepaid pricing component. By the end of it, everything looked ship-shape. We put the system into a limited production, supporting just the telco employees, then adding a few thousand new subscribers to it. Everything was still as fine and dandy as can be. And then we did the first round of data migration, transferring half a million subscribers from the legacy system to ours. And hell ensued. The prepaid billing part of the system worked fine. But most of the rest of it would go down with an OutOfMemoryError for an hour a day, during the peak hour, every workday, for several days.

To cut the very long story short, the root cause was as follows. The system was killed by overload of the Balance Query service. Subscribers could dial some magic number on their phones, and immediately see how much money they had on their account. Scalability of this service was just good enough to support half a million subscribers during most of the day, except the peak hour of a workday. Yes, we had tested scalability rather thoroughly, including that service. It proved several times more scalable in tests than our customer wanted it to be. There was a slight problem though. Scalability targets were derived from an analysis of the legacy system logs. This analysis overlooked one thing. In the legacy system, balance queries were a paid service. In the new one, their marketing people made it free. Indonesia is not exactly a rich country, cellphone bill is a noticeable item in most family budgets there. So, within days of the launch, users formed a habit of making balance queries after every other phone call. As a result, balance query service was called MUCH MORE often than projected. Whoops!

Lessons learned:

* Don't trust load forecasts based on the legacy system. Unless the new system is a straight port, the actual usage patterns will be different, perhaps drastically.

* Avoid close-sourced middleware like plague, especially from small vendors. In that case scalability bottleneck was caused by an ORM creating a massive object churn (and as some of you might remember, Java 1.3 garbage collector was not terribly concurrent). ORM vendor sold their product to a much larger company and ceased to exist. The new owner refused to support old versions, and haven't released their own version yet. In the days before JProfiler and SAP Heap Analyzer, tracking down a problem like this was bloody damn hard to begin with. Doing it without source code of the the piece actually causing the problem was a guessing game. By now, I must admit, my subconscious mind has a huge anti-middleware bias - this project was how it started to develop.

* first and second level support of large middleware and operating system vendors is useless. First level support just answers the phone. Second level support deals with issues that can be found in the knowledge base - an normally you can find it quicker yourself. If Google comes back with nothing, your only chance to resolve anything is talking to the third level support. This is where people who actually solve problems for living happen to be. Getting there in less than two months normally requires escalation to the senior management, if not executive level. If you are important enough for that, fine. Otherwise, tough luck.

As an aside, I really loved working side by side with the locals on that gig. Some of them literally had their nice, well-paying jobs hinged on the success of the project. And still, they were able to cope with the crisis without even losing their sense of humor. Not to mention a surprising number of really competent people in that particular IT organization. Terima kasih :)

2009/03/06

The keyword in "premature optimization is evil" is "premature"

Some friends of mine are working in a small IT organization, trying to adopt agile principles and methods. Recently, I sort of sold them on the "make it work, then make it fast" principle. Next thing that happened: they wrote some code, put it into production, and it proved to be too slow. Apparently, significant rework is needed now. Here is what I probably should have told them about avoiding premature optimization (and didn't).

Performance needs to be tested. "Make it work, then make it fast" doesn't say "make it work *in production*, then make it fast". The right sequence generally is "make it work, then make it fast, then put it into production".

Also, performance testing should be continuous, or at least ongoing. "Make it work first, then make it fast" doesn't mean "build the entire application first". If you are working on a story, and that story has interesting performance requirements (in my friends' case, half of their stories are probably like that), it's better to figure out performance requirements for the story, and actually test your implementation for meeting those requirements *as part of the story itself*. If for some reason it cannot be done, at least plan to do it as early as possible.

It's probably worth talking about the historical context in which it was originally observed that "premature optimization is a root of all evil". As far as I understand, it was said in 1970s. CPU cycles were an expensive commodity back then. So expensive that they were a real show-stopper for many software projects. This taught people to be extremely careful about wasting CPU cycles. And as it happens with any lesson of this kind, some took it to the other extreme and sacrificed a lot of readability for marginal gains in performance. That's what 90% of premature optimization stood for back then, and the advice behind the "root of all evil" observation was to create a readable and easy to maintain implementation first, then measure its performance, then to optimize it, if that proves necessary.

These days, the situation with CPU cycles is very different (they are cheap and plentiful), and some of us may have learned the lesson of avoiding premature optimization a little too well. To the point of avoiding optimization altogether, until the need for faster code becomes painfully obvious.

To summarize, the idea behind the "make it work first" principle is to trade maintainability for speed only when you have some proof that such a trade-off is necessary. Usually, the only way to prove it is to build something and test it. So, make it work first, check if it is fast enough as early as possible, then make it faster if still needed.

Let's use real languages for builds

For the last few months, I've ben away from the Ruby land, doing first a Java and now a .NET project. I also had to write a few hundred lines of Python. Hold on with condolences - it wasn't a bad thing at all. I actually loved the opportunity to tinker with other mainstream platforms and see how things are there these days. Besides, JetBrains makes wonderful tools. Besides, next time someone tells me Rails has too much magic, I'll be laughing - compared to Spring/Hibernate? - give me a break! But I digress.

This is to report that recent Java/.NET experience has GREATLY improved my appreciation of the usefulness, maintainability and sheer beauty of a well-written Rake build. A build in a real programing language is AWESOME. It is my [now well-informed] opinion that this is one of the things corporate IT sector should learn in the build/deployment area - replace Ant and MSBuild with tools that use real programming languages for expressing the build logic. I.e., Rake or similar.

Why would you want to do it? Here is an example. On the aforementioned Java project we needed to include a piece of logic into the build that would behave roughly as follows:

* Grab build artefacts for a specified build from Cruise
* Copy them to one or several servers (one for test environments, several for production)
* Unpack some of them on the target box(-es)
* Flip the symlinks
* Gracefully handle a failure of any step
* Be mindful about the fact that the target environment may or may not be a cluster.

This task was handled by a fellow ThoughtWorker, very smart guy with substantial platform experience in the Java land. Since the client asked us to avoid introducing too many alien artifacts, he did it all in Ant. As far as I remember, it took something like two days to sort it all out. The last two bullet points proved particularly hard. Apparently (and what a surprise!), XML is not the best way to describe loops and conditional statements.

Certain parts of the solution were a great laugh (of a bitter/cynical variety). The Daily WTF doesn't have a Hall of Fame, but if it did, this stuff should be featured there. Once again, we are talking about a very smart guy - many programmers could not even find a way to do it, I suspect.

If we were not constrained in the choice of tools, we would go for Rake/Vlad, and be done with it in under two hours, conservatively speaking - been there, done that. Overall, I think that a Rake build for that project (fairly straightforward Java web app), would be four to five times smaller, an equivalent number of times cheaper to create, and noticeably easier to maintain, too, despite being an alien artifact for the support team. Benefits of using Ant (which supposedly makes it easier to compile a bunch of Java classes and make a war file out of them) proved imaginary as soon as we started addressing automated deployment requirements.

Two days extra work is not such a big deal, if it is just an isolated incident - but it wasn't. Couple of days here, a day there - these costs do accumulate. By the way, before anyone asks - yes, we did look into Maven 2. Outcome: it's quite possible to coerce me into using Ant for another Java project, but I would flatly refuse to touch Maven with a ten feet pole - consultant ethics demand it. And yes, I do have logical arguments to support this position with. It just isn't the subject of my post.

If memory serves, it's been 5 or 6 years since the creator of Ant publicly stated that programming in angled brackets was a bad idea. And for at least five years we have some great alternatives to it coming from the dynamic languages community. These things are neither new, nor unproven anymore. They offer very tangible benefits - that's the point. Once you know about these alternatives, the only real reason to continue using Ant (or Nant, or MSBuild) for any build that goes beyond "compile and run unit tests" is inertia. Which, up to some point, is a perfectly good reason - new shiny toys tend to have nasty flipside. But that point for switching from things like Ant to things like Rake is past us, in this pundit's opinion. Question is, when are we, as an industry, finally going to do it?