Thursday, October 1, 2009

Queueing Theory and Small Batches

This blog post is about Software Development Process. It is the result of 6 months reflection on the difference between 2 different projects that I have been on, followed by 3 days at the UK Lean conference. I discuss batching of work items and the effect that the batch sizes have. It is quite long, and at the end I vent on Agile a bit.

My personal experience is that smaller batch sizes are significantly more efficient and I have been on a quest to find out why.

Since the early 21st century...

Iterative development methodologies have demonstrated that small batches offer improved performance over large batches for two main reasons. The first, is that you get feedback with each batch so you get to adapt and adjust to changing situations. When you only have one large batch, like in a traditional waterfall software development process, you only have one chance to get it right and people are rarely that good, and even if they are, the chance that the problem they were trying to solve is still exactly as how their customers described it, is very unlikely. The second reason that small batches has improved performance is that the incremental delivery of software gives the users the chance to incrementally create value from it.

When XP first became popular, many people said that it isn't possible. They said that you will just end up doing lots of 're-work'. Effectively what they were saying is that the 'transaction-cost' of a piece of software development work is too high. With high transaction costs, you have to use economies of scale to make doing stuff economical, which means batching work up and doing it all in one go. Kent Beck's hypothesis with XP was that he could lower the transaction costs enough so that you could do the work in smaller batches and derive the benefits mentioned above.

You can take almost any aspect of the software development process and show how XP lowered the transaction cost and broke it up into small batches. For example, XP took the process of requirements gathering and took all of the costly ceremony and documentation out of it and replaced with a face to face discussion that happened every week (I'm of the XP generation where an iteration was 1 week, not 2 weeks). We didn't need the ceremony because we only talked about 1 week worth of requirements, and if we needed to, we could always talk to the customer directly mid iteration.

Changing code also came with a high transaction cost; it was much harder to reduce than than the transaction cost of requirements gathering. The solution was automated testing, refactoring and 'doing the simplest thing that could possibly work' (or YAGNI). By reducing the complexity of the code base, we made it easy to change and therefore reduced the transaction cost. By automating tests that could run and find bugs as soon as they were introduced, we significantly reduced the high cost of fixing bugs in production, and once again reduced the transaction cost.

So XP reduced transaction costs, enabling the work to be done in small batches which enabled the incorporation of feedback and the incremental delivery of software, which hopefully meant the incremental creation of value by the users. It didn't necessarily mean that the software got built faster, it meant that it was better (more of what was needed) and that the users could use some of it sooner.

Or perhaps it even meant that it got built faster as well...

Queuing theory has been around since 1909 when a mathematician called Agner Krarup Erlang at the Copenhagen Telephone Company wrote a paper applying probability and statistics to telephone switching problems. Using his methods, he was accurately able to estimate the probability that a call would be blocked at various levels of capacity utilization in the queue. Erlang discovered that capacity utilization increased queues exponentially and that variability of the items in the queue increase the queues linearly. Anyone that has been on a highway at rush hour has experienced this. When the highway has a high level of utilization, the cars drive closer together and small amount of variation in the speed of the car causes the traffic to come to a stand still. When the highway is less utilized, the variation is easily accommodated and traffic flows freely.

We see similar queues in our software development processes because transaction costs leads to batching of items which leads to high utilization in queues.

Transaction Cost -> Batching -> High Utilization

Some examples of queues and batches:

(Queue)->[Batch]
(Features)->[QA Testing]
(Features)->[Release]
(Changes to code base)->[Running of the Tests]
(Bugs)->[Developer fixing them]
(Requirements)-[Planning Meeting]

We have queues and batching in our processes, but do we observe the same slow down in the system that Erlang observed in his switches? Yes. For example, consider the following side affects of large batches:

Large batches reduce efficiency and create rework! If you have a large team, and each team member is working on a different feature, a lot of communication is required to prevent each developer from 'bumping into' another developer in the code base. Also with many features happening at one time, common functionality doesn't get pulled out and handled separately in a single location in the code base, unless there is a lot of costly communication. Consider testing a large number of features for a release, to testing a small number of features. With the large list, each time you find one bug you need to check through the list of all the other bugs found to make sure it isn't a duplicate. When the tester describes the bug to the developer, the developer probably doesn't remember working on it since there is a good chance that it was implemented a significant amount of time prior to it being found.

Large batches create even larger batches. Consider the rework example above, where each developer produces their own solution to a common problem. This complicates the code base making it harder to change and increasing the transaction cost for adding a new feature. Before long, the code base is such a mess that the developers will be saying 'Just give us all the features that you need, we have to go lock ourselves in the conference room for the next 3 months and rewrite the entire system'. (Trust me, I've been there).

Large batches lower motivation and the sense of urgency. This point is perhaps the simplest, and has the potential to be the biggest performance factor of a team. Consider a developer with a task that is due in a month, and a developer with a task that is due by the end of the day. Which one is going to feel a greater sense of urgency? Even if the month long task is 40 times more work than the 1 day task, there will be many other developers with a task that is due in a month, and the chances are that one of those guys is going to fail first and integration on that date probably isn't going to happen anyway. A developers world changes pretty quickly, and the chances are pretty good that in 1 month things will be different.

The evidence is strong that larger batch sizes:
# Give you less feedback, causing you to build the wrong thing.
# Delay the opportunity that users can create value with your software.
And...
# Is less efficient that small batches anyway.

Ultimate batch size

If small batches are good, does that mean that we should strive to remove any batching? Yes, but only to a point. We can make our batches smaller by continuously driving down transaction cost (just as XP does), but transaction cost is our constraint on batch size.

In his book, 'The Principles of Product Development FLOW' (where most of this has been shamelessly ripped from) Don Rienersten suggest that batch size reduction often has an even greater impact than you would think and that our estimates of optimal batch size are often too high. One reason for this is that our assumptions on transaction costs are often wrong and that they can be lowered beyond our expectations. This is often discovered when teams start lowering transaction costs and start to see the economies of it. Teams also start to find extra transaction costs in areas where they didn't expect to find them.

Don offers the heuristic that on average, hidden costs of large batch sizes are twice what people expect them to be and that optimum batch size is 70% of their expectations. He also shows that optimal queue utilization follows a 'flat bottomed U' function, enabling teams to make massive improvements to performance without having to do detailed analysis and get their numbers 'exactly right'.

Personal Experience

The science that Don is demonstrating is something that I have experienced first hand (and at the time didn't know why it was working). On 2 different projects lasting a combined 6 years, I pushed batch sizes as low as I could. I ran teams that built software in fast moving and demanding financial companies and I knew that I didn't want to break anything when I released, so I released as little as possible. Of course this also meant as often as possible, which was pretty much at the end of each trading day.

We drove down the transaction cost to make this economical and got really good at releasing software that didn't break because we only released a single days worth of development. We didn't use any QA people (transaction cost) and we had the minimal amount of automated tests (transaction cost). We didn't batch up requirements either. Whenever we had time to work on new stuff for our users, we went to them and asked for new work. They never asked us for status reports (transaction cost) or estimates (transaction cost) because they either got it the next day, or they got something that proved that we were making progress.

We built lots of software in a low stress, high motivation environment. We spent the majority of our time working on the value added activities of 'adding new functionality' and 'reducing transaction-costs'.

The project that I am on now is a different story. We rarely meet our customers except when there is a problem in production. Stories are scheduled into 2 week iterations by intermediaries and are often dependent on other teams working in similar sized batches. Developers work on the same tasks for days at a time spending a large portion of their time circumventing infrastructure and code quality issues. Our checkout/modify/test/checkin cycle time is painfully slow. QA is done by a single manual testing resource. We work on multiple branches of source code and switching between them takes minutes. Until recently, deployments were done manually to windows machines and often included a 'post deployment configuration' step. Several versions of our software can exist in production at anyone time forcing us to maintain backwards compatability with each of them.

It is a long way from being the worst project in the world. We are Agile. But I would estimate that our transaction-costs are probably between 5 and 10 times higher in the larger batch size project, and that our percentage of transaction-cost work in the small batch size project was %40. Do the math and it starts to make sense why Software Engineering departments are consider cost centers.

Conclusion

Reducing the batch sizes in your project is like 'draining the swamp'. When you do it, you will see the non value adding transaction costs that were lurking below the surface. Find ways to reduce those costs, and then drain the swamp some more.

P.S.

Most Agile teams today almost certainly have higher transaction costs than early XP teams. Agile/SCRUM/XP is going backwards in my opinion and missing the point. To be successful, software teams need to generate value by doing value added things, not unnecessary transaction cost work. The Agile manifesto says that we focus on people and interactions over process and tools. Ok, sounds good. But why? How does that generate more value? Agile lost its focus, and I like the idea of using the science behind Don's work to bring it back.

The ideas behind the original Agile Manifesto were hijacked by the people who sell transaction cost. Don't believe me? just google Agile and see the list of tools developers and consultants that come up. Go to Amazon and look at the books on the subject, how many of them are about how to remove a transaction cost? how many are about adding a transaction cost?

I'm starting to think that the Agile movement has had its time in the spotlight. So what's after Agile? I am going to keep an eye on Don Reinertsen and his friends over at the Lean Software Consortium (but I do hope that they change the name).

5 comments:

kevin Taylor said...

Gareth,

Nice post. I agree. There is a game that I was taught by Jeff Patton that helps demonstrate queueing theory. You can read more about it on my blog: The Coin Flip Game

Keep your eye on kanban and lean. It is the right direction for many teams in the software industry.

reevesy said...

Thanks Kevin. I met Jeff for the first time at the conference.

floehopper said...

I strongly identify with your experience of these different project types. To me your article sums up the difference between XP and Agile. In the same way that Agile lacks a rigorous definition, it also seems to encourage a lack of rigour in process e.g. not driving transaction costs down.

You ask: "What's after Agile"? I'd ask a different question: "What was so wrong with XP that we needed Agile?".

Dave Hoover said...

Gareth, it's a strong post. I'd love to see it condensed a bit more... but maybe that's just because I'm familiar with your contexts. We need more posts like this that talk about smaller, faster ways to add value for our customers, and less posts that ignore the fact that we even have customers.

Christian said...

We work in small batches to avoid having to make decisions too early in the face of uncertainties. For one, circumstances change, and that can render our work useless. For another, however, humans are just bad at making decisions over longer time frames with unknown and/or fluctuating probabilities. There's an irony here, because our civilization (and arguably any civilization) is built on better ways of planning beyond where next week's food is going to come from. Nevertheless, I think we humans are just more effective when we're working towards better-defined, shorter term goals.