That Russian who never sleeps

Many years ago, yours truly was remembered (fondly, I hope) by some people in Jakarta, Indonesia as "that Russian who never sleeps". This is how it happened.

A company I worked for was a vendor of telecom billing and customer care solutions. At that point in time, we have just produced a brand new version of the product, and our gig in Indonesia was basically the first attempt to implement it in production. As if that didn't sound scary enough already, one of the centerpieces of the solution was a prepaid pricing engine. Newly minted, as was the rest of the system. The difference between postpaid and prepaid billing is terrifying. If something goes wrong with postpaid billing, you can always roll it back, fix the problem and rerun the file with the call records again. If prepaid billing misbehaves, within minutes several million people go incommunicado.

So, for several months before going live, we (the implementation project team) tested the heck out of the product, from all sorts of angles, placing special emphasis on the prepaid pricing component. By the end of it, everything looked ship-shape. We put the system into a limited production, supporting just the telco employees, then adding a few thousand new subscribers to it. Everything was still as fine and dandy as can be. And then we did the first round of data migration, transferring half a million subscribers from the legacy system to ours. And hell ensued. The prepaid billing part of the system worked fine. But most of the rest of it would go down with an OutOfMemoryError for an hour a day, during the peak hour, every workday, for several days.

To cut the very long story short, the root cause was as follows. The system was killed by overload of the Balance Query service. Subscribers could dial some magic number on their phones, and immediately see how much money they had on their account. Scalability of this service was just good enough to support half a million subscribers during most of the day, except the peak hour of a workday. Yes, we had tested scalability rather thoroughly, including that service. It proved several times more scalable in tests than our customer wanted it to be. There was a slight problem though. Scalability targets were derived from an analysis of the legacy system logs. This analysis overlooked one thing. In the legacy system, balance queries were a paid service. In the new one, their marketing people made it free. Indonesia is not exactly a rich country, cellphone bill is a noticeable item in most family budgets there. So, within days of the launch, users formed a habit of making balance queries after every other phone call. As a result, balance query service was called MUCH MORE often than projected. Whoops!

Lessons learned:

* Don't trust load forecasts based on the legacy system. Unless the new system is a straight port, the actual usage patterns will be different, perhaps drastically.

* Avoid close-sourced middleware like plague, especially from small vendors. In that case scalability bottleneck was caused by an ORM creating a massive object churn (and as some of you might remember, Java 1.3 garbage collector was not terribly concurrent). ORM vendor sold their product to a much larger company and ceased to exist. The new owner refused to support old versions, and haven't released their own version yet. In the days before JProfiler and SAP Heap Analyzer, tracking down a problem like this was bloody damn hard to begin with. Doing it without source code of the the piece actually causing the problem was a guessing game. By now, I must admit, my subconscious mind has a huge anti-middleware bias - this project was how it started to develop.

* first and second level support of large middleware and operating system vendors is useless. First level support just answers the phone. Second level support deals with issues that can be found in the knowledge base - an normally you can find it quicker yourself. If Google comes back with nothing, your only chance to resolve anything is talking to the third level support. This is where people who actually solve problems for living happen to be. Getting there in less than two months normally requires escalation to the senior management, if not executive level. If you are important enough for that, fine. Otherwise, tough luck.

As an aside, I really loved working side by side with the locals on that gig. Some of them literally had their nice, well-paying jobs hinged on the success of the project. And still, they were able to cope with the crisis without even losing their sense of humor. Not to mention a surprising number of really competent people in that particular IT organization. Terima kasih :)


Post a Comment

Subscribe to Post Comments [Atom]

<< Home