Goodbye ThoughtWorks, hello Scribd

This is to inform the two and a half people reading this blog that I left my consulting career at ThoughtWorks for the Silicon Valley Internet startup scene, joining Scribd.

ThoughtWorks is by far the best place to be a software consultant, and I still love these folks. But commuting across the continent sucks, especially when one has teenage kids.

Scribd is a service that allows users to publish documents in a form that can be embedded directly into a web page. There is an interesting set of Internet-scale technical problems (that's Enterprise-scale * 10^3), and [hopefully] none of the non-technical problems that come with a consulting job. Definitely no travel (yay!)

Although Scribd is (naturally) in San Francisco, I am staying in Calgary, and my lunch schedule is completely free right now :) Feel free to give me a holler.


That Russian who never sleeps

Many years ago, yours truly was remembered (fondly, I hope) by some people in Jakarta, Indonesia as "that Russian who never sleeps". This is how it happened.

A company I worked for was a vendor of telecom billing and customer care solutions. At that point in time, we have just produced a brand new version of the product, and our gig in Indonesia was basically the first attempt to implement it in production. As if that didn't sound scary enough already, one of the centerpieces of the solution was a prepaid pricing engine. Newly minted, as was the rest of the system. The difference between postpaid and prepaid billing is terrifying. If something goes wrong with postpaid billing, you can always roll it back, fix the problem and rerun the file with the call records again. If prepaid billing misbehaves, within minutes several million people go incommunicado.

So, for several months before going live, we (the implementation project team) tested the heck out of the product, from all sorts of angles, placing special emphasis on the prepaid pricing component. By the end of it, everything looked ship-shape. We put the system into a limited production, supporting just the telco employees, then adding a few thousand new subscribers to it. Everything was still as fine and dandy as can be. And then we did the first round of data migration, transferring half a million subscribers from the legacy system to ours. And hell ensued. The prepaid billing part of the system worked fine. But most of the rest of it would go down with an OutOfMemoryError for an hour a day, during the peak hour, every workday, for several days.

To cut the very long story short, the root cause was as follows. The system was killed by overload of the Balance Query service. Subscribers could dial some magic number on their phones, and immediately see how much money they had on their account. Scalability of this service was just good enough to support half a million subscribers during most of the day, except the peak hour of a workday. Yes, we had tested scalability rather thoroughly, including that service. It proved several times more scalable in tests than our customer wanted it to be. There was a slight problem though. Scalability targets were derived from an analysis of the legacy system logs. This analysis overlooked one thing. In the legacy system, balance queries were a paid service. In the new one, their marketing people made it free. Indonesia is not exactly a rich country, cellphone bill is a noticeable item in most family budgets there. So, within days of the launch, users formed a habit of making balance queries after every other phone call. As a result, balance query service was called MUCH MORE often than projected. Whoops!

Lessons learned:

* Don't trust load forecasts based on the legacy system. Unless the new system is a straight port, the actual usage patterns will be different, perhaps drastically.

* Avoid close-sourced middleware like plague, especially from small vendors. In that case scalability bottleneck was caused by an ORM creating a massive object churn (and as some of you might remember, Java 1.3 garbage collector was not terribly concurrent). ORM vendor sold their product to a much larger company and ceased to exist. The new owner refused to support old versions, and haven't released their own version yet. In the days before JProfiler and SAP Heap Analyzer, tracking down a problem like this was bloody damn hard to begin with. Doing it without source code of the the piece actually causing the problem was a guessing game. By now, I must admit, my subconscious mind has a huge anti-middleware bias - this project was how it started to develop.

* first and second level support of large middleware and operating system vendors is useless. First level support just answers the phone. Second level support deals with issues that can be found in the knowledge base - an normally you can find it quicker yourself. If Google comes back with nothing, your only chance to resolve anything is talking to the third level support. This is where people who actually solve problems for living happen to be. Getting there in less than two months normally requires escalation to the senior management, if not executive level. If you are important enough for that, fine. Otherwise, tough luck.

As an aside, I really loved working side by side with the locals on that gig. Some of them literally had their nice, well-paying jobs hinged on the success of the project. And still, they were able to cope with the crisis without even losing their sense of humor. Not to mention a surprising number of really competent people in that particular IT organization. Terima kasih :)


The keyword in "premature optimization is evil" is "premature"

Some friends of mine are working in a small IT organization, trying to adopt agile principles and methods. Recently, I sort of sold them on the "make it work, then make it fast" principle. Next thing that happened: they wrote some code, put it into production, and it proved to be too slow. Apparently, significant rework is needed now. Here is what I probably should have told them about avoiding premature optimization (and didn't).

Performance needs to be tested. "Make it work, then make it fast" doesn't say "make it work *in production*, then make it fast". The right sequence generally is "make it work, then make it fast, then put it into production".

Also, performance testing should be continuous, or at least ongoing. "Make it work first, then make it fast" doesn't mean "build the entire application first". If you are working on a story, and that story has interesting performance requirements (in my friends' case, half of their stories are probably like that), it's better to figure out performance requirements for the story, and actually test your implementation for meeting those requirements *as part of the story itself*. If for some reason it cannot be done, at least plan to do it as early as possible.

It's probably worth talking about the historical context in which it was originally observed that "premature optimization is a root of all evil". As far as I understand, it was said in 1970s. CPU cycles were an expensive commodity back then. So expensive that they were a real show-stopper for many software projects. This taught people to be extremely careful about wasting CPU cycles. And as it happens with any lesson of this kind, some took it to the other extreme and sacrificed a lot of readability for marginal gains in performance. That's what 90% of premature optimization stood for back then, and the advice behind the "root of all evil" observation was to create a readable and easy to maintain implementation first, then measure its performance, then to optimize it, if that proves necessary.

These days, the situation with CPU cycles is very different (they are cheap and plentiful), and some of us may have learned the lesson of avoiding premature optimization a little too well. To the point of avoiding optimization altogether, until the need for faster code becomes painfully obvious.

To summarize, the idea behind the "make it work first" principle is to trade maintainability for speed only when you have some proof that such a trade-off is necessary. Usually, the only way to prove it is to build something and test it. So, make it work first, check if it is fast enough as early as possible, then make it faster if still needed.

Let's use real languages for builds

For the last few months, I've ben away from the Ruby land, doing first a Java and now a .NET project. I also had to write a few hundred lines of Python. Hold on with condolences - it wasn't a bad thing at all. I actually loved the opportunity to tinker with other mainstream platforms and see how things are there these days. Besides, JetBrains makes wonderful tools. Besides, next time someone tells me Rails has too much magic, I'll be laughing - compared to Spring/Hibernate? - give me a break! But I digress.

This is to report that recent Java/.NET experience has GREATLY improved my appreciation of the usefulness, maintainability and sheer beauty of a well-written Rake build. A build in a real programing language is AWESOME. It is my [now well-informed] opinion that this is one of the things corporate IT sector should learn in the build/deployment area - replace Ant and MSBuild with tools that use real programming languages for expressing the build logic. I.e., Rake or similar.

Why would you want to do it? Here is an example. On the aforementioned Java project we needed to include a piece of logic into the build that would behave roughly as follows:

* Grab build artefacts for a specified build from Cruise
* Copy them to one or several servers (one for test environments, several for production)
* Unpack some of them on the target box(-es)
* Flip the symlinks
* Gracefully handle a failure of any step
* Be mindful about the fact that the target environment may or may not be a cluster.

This task was handled by a fellow ThoughtWorker, very smart guy with substantial platform experience in the Java land. Since the client asked us to avoid introducing too many alien artifacts, he did it all in Ant. As far as I remember, it took something like two days to sort it all out. The last two bullet points proved particularly hard. Apparently (and what a surprise!), XML is not the best way to describe loops and conditional statements.

Certain parts of the solution were a great laugh (of a bitter/cynical variety). The Daily WTF doesn't have a Hall of Fame, but if it did, this stuff should be featured there. Once again, we are talking about a very smart guy - many programmers could not even find a way to do it, I suspect.

If we were not constrained in the choice of tools, we would go for Rake/Vlad, and be done with it in under two hours, conservatively speaking - been there, done that. Overall, I think that a Rake build for that project (fairly straightforward Java web app), would be four to five times smaller, an equivalent number of times cheaper to create, and noticeably easier to maintain, too, despite being an alien artifact for the support team. Benefits of using Ant (which supposedly makes it easier to compile a bunch of Java classes and make a war file out of them) proved imaginary as soon as we started addressing automated deployment requirements.

Two days extra work is not such a big deal, if it is just an isolated incident - but it wasn't. Couple of days here, a day there - these costs do accumulate. By the way, before anyone asks - yes, we did look into Maven 2. Outcome: it's quite possible to coerce me into using Ant for another Java project, but I would flatly refuse to touch Maven with a ten feet pole - consultant ethics demand it. And yes, I do have logical arguments to support this position with. It just isn't the subject of my post.

If memory serves, it's been 5 or 6 years since the creator of Ant publicly stated that programming in angled brackets was a bad idea. And for at least five years we have some great alternatives to it coming from the dynamic languages community. These things are neither new, nor unproven anymore. They offer very tangible benefits - that's the point. Once you know about these alternatives, the only real reason to continue using Ant (or Nant, or MSBuild) for any build that goes beyond "compile and run unit tests" is inertia. Which, up to some point, is a perfectly good reason - new shiny toys tend to have nasty flipside. But that point for switching from things like Ant to things like Rake is past us, in this pundit's opinion. Question is, when are we, as an industry, finally going to do it?


On testing tools

Here is a scenario that seems to play out all too often.

A young software development organization decides to address the lack of quality in software they produce by adding some testers. They hire a bunch of people (usually with little or no development skills) to do some variation of manual software testing. Before long, the testers realize that critical regression defects are a particularly dangerous thing, because they can be introduced in parts of the application that already made it through the manual testing and aren't eyeballed often enough. Enter manual regression testing. This amounts to maintaining lists of things to test, and performing those tests by hand time, after time, after time, after time.

Everything fine and dandy so far. There is a couple of annoying problems with the manual regression testing. It is time consuming activity, not particularly rewarding, and when you repeat the same sequence of tests the fifth time, simply boring. Management also doesn't like the fact that it takes their testers a week to bless a new release. This becomes painfully obvious when some urgent patch needs to be pushed out into production, and there is no way to do it other than just pushing the deploy button and praying. Therefore, someone (management or testers themselves) comes up with the idea to automate regression tests. This task, naturally, falls to the testing department. What does testing department do? They start looking for a testing tool, of course!

Decision matrix for selecting a tool typically contains these two heavyweight factors. (1) should be able to drive our application frontend; (2) should be usable by a non-programmer (since there are no programmers in the testing department). Eventually, they come across a sales brochure for one of a breed of commercial products (hmm... let's call it Irrational Droid... or Pluto MacWalker). The brochure basically says: here is a tool. You point it to your app, click some buttons, the tool remembers what you did, spits it out as a script, and opens it in an editor. You sprinkle some assertions, maybe some parameterization, and - voila! - you have a test suite. It's all so simple, a monkey could do it. And it costs mere $100,000.

The brochure may not mention a few minor things. The tool uses a home grown script language, poorly designed and buggy. Standard library of the language consists of two and a half libraries, which is more than enough for a sales demo. Encapsulation is not supported. Text editor provided with the tool is worse than Notepad, which by the way cannot be used because some parts of the script are saved in a binary file. Version control cannot be used for the same reason. External libraries written in another language can be attached (through some sort of Rube Goldberg device), and may even work. Or some other heinous design compromises to appease capture/playback gods.

However, it does know how to drive application front-end, and can be used by a non programmer to do test automation. Since those two constraints have heaviest weights in the decision matrix (and the latter, incidentally, excludes all the open source tools, written by programmers for programmers), the Droid wins the contest. Now this testing department invests some non-trivial amount of money in licenses and training, and starts using the tool. Eventually, they discover that doing anything significantly bigger than a sales demo is an exercise in anger management, and the resulting suite is broken within two weeks, because something changed in the app. None of the developers on the team are wiling to touch the Droid with a ten feet pole to fix it, and the licenses are too expensive to give it to everyone on the team, anyway. So, the only practical recourse is to recapture the whole shebang. After several attempts to salvage the situation, automation project is abandoned, and the lesson is learned: test automation is too expensive, and not worth it.

What went wrong here? In my experience, it always seems to start with the notion that some expensive tool would allow a non-programmer do test automation. This doesn't work for the very same reason tools-driven approaches don't work in normal software development. In fact, test automation is not at all different from any other software development, so it's not even surprising.

How do you avoid this situation? First of all, accept that it takes 90% development skill and 10% testing skill to do test automation. A non-programming tester, with a few days of training, can certainly contribute tests to an existing suite, but don't expect him or her to design the underlying automation framework well on their own. Therefore, staff this exercise accordingly. And when evaluating technologies to aid your automation effort, look for something that people in the development team would be comfortable with. What you normally need is a library that can drive the front-end of your application, in a language that developers already know, or would at least not hate learning. Most of the time, particularly when your app is a web app, you should not spend money on tool licenses - there are perfectly adequate open-source alternatives.

My next post will be about performance/scalability testing strategies.


Functional test automation through UI is hard, part 2

In the previous post, I described a mismatch in the abstraction level between automated UI tests and business requirements that those tests purport to express. I also promised to tell something about how to deal with it. First, let me admit (with a due dose of regret) that I've no idea how to make the problem go away.

Automation is expensive. Automation costs can easily be so big that manual regression testing would be cheaper and more efficient. This is especially true for UI tests (they are easy to perform manually, and expensive to automate). So, the art of test automation is to be selective about what you automate, and learn how to cut down the costs of automation.

Avoid UI tests

I don't mean to avoid them altogether here. Recognize situations where test automation through UI is too expensive, and just not do it then.

Limit the scope

It is often wise to use automated UI tests for smoke testing only. Rely on manual testing to make sure that all your buttons are aligned, dropdown lists have the right values, and so on. Use automated tests as a fast fail mechanism, protecting the development team against commits that completely break the application. If your application has 10 primary use cases, write 10 UI tests, each covering a success path through one of these cases, and stop there - until you really feel the need to add more UI tests.

Opt for service layer tests whenever possible

OK, we all know that manual regression testing is slow, mind-numbing, and costly. Luckily, there is another way to automate tests. Service layer should have an API that fits the business problem quite nicely. Therefore, it's a great medium to automate tests that are about some kind of transaction processing, or workflow, or business rules, and not really about the UI behavior per se. So, if your app has some complex back-end logic, and a dumb UI (which, for the sake of maintainability, is how business software should strive to be), write automated functional tests that talk to the service layer directly.

Functional testing through service layer should not be confused with unit tests. The kind of tests I'm talking about here should be driving the entire application, minus the (dumb) UI, and the purpose is to cover integration bugs that may creep up in the glue between your carefully unit-tested classes.

Postpone automating UI tests

In any new functionality, UI has most bugs and is most likely to change at the later stages of development, when customers put their hands on the app and realize what else they need to change. If you implement an automated UI test early on, it is virtually guaranteed to need significant rework, too. Incidentally, new functionality in active development is regularly eyeballed by humans. So regression bugs in it are caught early, anyway. It's new regression bugs in the old functionality that go unnoticed until someone in production support gets that proverbial 3am phone call.

Therefore, do not automate UI tests for new functionality until it's past the active test/change feedback loop or two. Typically, this means a couple of weeks or even a month after the first time a feature was submitted to testing.

Don't use UI level tests to drive development

Corollary to the above. As executable specifications, automated UI tests are grossly expensive and have to specify too many details too early in the process. It's just another form of the Big Design Up-Front anti-pattern.

If you can't avoid it, then do it right!

So far, I covered a few possible ways to avoid automating tests through UI. Now let's talk about another angle. If you can reduce the cost of UI test automation, you won't need to avoid it so much. :) And here are some suggestions about that.

Treat test automation as software development

Test automation is software development first, and testing second. Doing it takes as much sophistication in development tools and techniques, as writing production code.

Assign the task to a programmer

Most testing departments trying to introduce test automation make this mistake - they assign the task to a tester. The right mix of skills required to do automation is 10% testing and 90% software development. If you happen to have someone in your testing department who could do a convincing job as a senior developer, awesome. Otherwise, you are much better off recruiting someone from development to do it. You can train testers to do it, but the chance of someone with no development experience to figure it out on their own looks slim (I have seen a few failures, and no success stories so far).

Use version control

Putting your test suite in the same version control repository as your production code is the most basic thing you can do to bring some order to the affair. As silly as it sounds (to a developer), most first-time test automation initiatives I've seen didn't use version control. In fact, I've seen at least two major commercial test automation tools that didn't even support version control (along the lines of "large binary file that has to be present, cannot be built from sources, and changes every time you run a test"). To me, that looks like reason enough to avoid using the tool. Once again: whenever you are developing software, YOU MUST use version control.

Have a design

So, there this huge abstraction mismatch between functional requirements and UI constructs, which I mentioned before. To bridge abstraction mismatches, software is structured in layers. Since test automation is software development, the same principle applies. Typically, you want to have at least three layers - the test script, the UI map, and the UI driver. UI driver is basically a library to manipulate the UI controls directly (e.g., Selenium or Watir), Test scripts should be written in terms that match analyst thinking as close as possible. I think, you should *literally* try to make them readable by a non-technical business analyst, or customer. UI map is the layer that converts actions described by test scripts into UI driver calls. For example, a test script can read like "create user profiles for Joe and Jane, search for Joe, Joe's profile should be in the search results", UI map would know that "search for Joe" means writing "Joe" in the text box with id of "search_query", clicking on a button with id of "search_button", and making sure that we landed on a page with the URL of /search?query=Joe and title "MyApp - Search Results for Joe" , rather than "HTTP 500 - Internal Server Error".

Do not rely on capture/playback

In the light of "test automation is software development", capture/playback is code generation, and of a bad kind. It produces badly designed, repetitive, unreadable, and therefore unmaintainable code. If you use captured script as a way to discover exactly what is going on between the browser and the server (i.e., as some sort of a network sniffer aware of higher-level protocols), that's OK. If you are checking this generated stuff into your version control repository verbatim, you are probably making a big mistake.

Use continuous integration

It should be obvious that if you have some automated tests, you will be better off running them as often as practically possible. For the sake of discovering regression bugs as soon as possible, but also for the sake of keeping the tests themselves relevant. When some change breaks an automated test, it's much easier to figure out what has changed, who made the change, and deal with it right there and then, instead of running the entire suite once a month and discovering that half of the tests are broken for all sorts of strange reasons. Having said that, UI tests (unlike unit and service-level integration tests) are usually too cumbersome for the developers to run as part of the regular pre-commit process. Therefore, they should not be included in the main continuous integration loop. It's better to run them in a separate CI loop and tolerate broken builds there.

Remember about the truck number

Automated tests should be maintainable by more than one person. Ideally, any developer (note: developer, not tester!) on the team should be able to do it. By the way, this puts some interesting constraints on the choice of automation tools. And since this text is already too long for a blog, and the plane I'm on is about to land, I'll just use this observation as a segway. Next time I'm on the plane, I'll try writing down my thoughts about automation tools. Which is a somewhat painful subject...


Must you always use a rich domain?

Ask a Java developer in my neck of the woods to create a web application (of the usual "shovel some data from Oracle to HTML" variety) and s/he will probably come up with an instant architecture. Some sort of MVC => Spring => domain layer => Hibernate => database. Blueprint done, let's go write some code now.

What can possibly be wrong with this? The Spring => domain layer => Hibernate part. There are many simple web applications where a full-blown domain persisted by a full-blown object-relational mapper and wired together by a full-blown dependency injection framework with aspect-oriented programming features provides no value, but has a big price in terms of complexity, scalability and long-term maintenance. And every complex web application that I have seen so far had some areas where rich domain backed by ORM was not the best way to go, either.

A classical example is a web application that is all about searching and viewing some stuff, stored in a relational database. Usually, there is some data manipulation involved, but it's all CRUD. Whenever this application has a domain, it is inevitably an anemic one - in other words, there are getters and setters, maybe some data validation logic, and not much interesting behavior. Unfortunately, there seems to be a blind spot where every greenfield application must have a domain layer complete with DI and ORM. Surely, all this byte-code manipulation voodoo and angled-brackety declarative configuration goodness must have some cost to it? And it does!

First, there is always a performance penalty to pay. The moment you decide to go with a full-scale ORM, you probably increase production hardware costs by at least 50%. I don't have any hard numbers to back this with, so this is just a subjective opinion of someone in the trenches who is considered "a performance dude" in ThoughtWorks. In the world where developer time is expensive and hardware is not, this is a great tradeoff, as long as you save developer time.

But this is the catch. When there is no rich domain, no need for distributed transaction management, or advanced caching strategies or other things of this nature, you may not be saving anything - quite the contrary. You just end up writing more code to wire all those decoupled layers together, running longer builds (a much bigger productivity killer than most people realize), having to deal with more interesting problems, reading much larger and less informative exception stack traces, and generally working harder than necessary.

And then there is the maintenance cost. Conceptual complexity, all this cool voodoo, is hard on production support. It creates more situations that regular support people can't cope with on their own and have to escalate to platform experts.

One major selling point for Hibernate is that it eliminates a lot of boilerplate JDBC code to map data from rows to objects. One day, Sun will hopefully bake something like LINQ into Java and the boilerplate data mapping issue will be gone for good. Until then, there are libraries out there that do just this (IBATIS comes to mind) at a tiny fraction of Spring+Hibernate complexity cost.

Now, there are people who take this to the other extreme, and just mix SQL with markup. Although it's a great design for a "Hello, World"-type system, I'm not radical enough to advocate this for anything bigger.

So, next time you start writing an application that is 90% data display and 10% CRUD manipulation -- if something like Rails or Django is not an option -- please at least think about MVC => service => hand-coded SQL option.