Thursday, October 1, 2009

Queueing Theory and Small Batches

This blog post is about Software Development Process. It is the result of 6 months reflection on the difference between 2 different projects that I have been on, followed by 3 days at the UK Lean conference. I discuss batching of work items and the effect that the batch sizes have. It is quite long, and at the end I vent on Agile a bit.

My personal experience is that smaller batch sizes are significantly more efficient and I have been on a quest to find out why.

Since the early 21st century...

Iterative development methodologies have demonstrated that small batches offer improved performance over large batches for two main reasons. The first, is that you get feedback with each batch so you get to adapt and adjust to changing situations. When you only have one large batch, like in a traditional waterfall software development process, you only have one chance to get it right and people are rarely that good, and even if they are, the chance that the problem they were trying to solve is still exactly as how their customers described it, is very unlikely. The second reason that small batches has improved performance is that the incremental delivery of software gives the users the chance to incrementally create value from it.

When XP first became popular, many people said that it isn't possible. They said that you will just end up doing lots of 're-work'. Effectively what they were saying is that the 'transaction-cost' of a piece of software development work is too high. With high transaction costs, you have to use economies of scale to make doing stuff economical, which means batching work up and doing it all in one go. Kent Beck's hypothesis with XP was that he could lower the transaction costs enough so that you could do the work in smaller batches and derive the benefits mentioned above.

You can take almost any aspect of the software development process and show how XP lowered the transaction cost and broke it up into small batches. For example, XP took the process of requirements gathering and took all of the costly ceremony and documentation out of it and replaced with a face to face discussion that happened every week (I'm of the XP generation where an iteration was 1 week, not 2 weeks). We didn't need the ceremony because we only talked about 1 week worth of requirements, and if we needed to, we could always talk to the customer directly mid iteration.

Changing code also came with a high transaction cost; it was much harder to reduce than than the transaction cost of requirements gathering. The solution was automated testing, refactoring and 'doing the simplest thing that could possibly work' (or YAGNI). By reducing the complexity of the code base, we made it easy to change and therefore reduced the transaction cost. By automating tests that could run and find bugs as soon as they were introduced, we significantly reduced the high cost of fixing bugs in production, and once again reduced the transaction cost.

So XP reduced transaction costs, enabling the work to be done in small batches which enabled the incorporation of feedback and the incremental delivery of software, which hopefully meant the incremental creation of value by the users. It didn't necessarily mean that the software got built faster, it meant that it was better (more of what was needed) and that the users could use some of it sooner.

Or perhaps it even meant that it got built faster as well...

Queuing theory has been around since 1909 when a mathematician called Agner Krarup Erlang at the Copenhagen Telephone Company wrote a paper applying probability and statistics to telephone switching problems. Using his methods, he was accurately able to estimate the probability that a call would be blocked at various levels of capacity utilization in the queue. Erlang discovered that capacity utilization increased queues exponentially and that variability of the items in the queue increase the queues linearly. Anyone that has been on a highway at rush hour has experienced this. When the highway has a high level of utilization, the cars drive closer together and small amount of variation in the speed of the car causes the traffic to come to a stand still. When the highway is less utilized, the variation is easily accommodated and traffic flows freely.

We see similar queues in our software development processes because transaction costs leads to batching of items which leads to high utilization in queues.

Transaction Cost -> Batching -> High Utilization

Some examples of queues and batches:

(Queue)->[Batch]
(Features)->[QA Testing]
(Features)->[Release]
(Changes to code base)->[Running of the Tests]
(Bugs)->[Developer fixing them]
(Requirements)-[Planning Meeting]

We have queues and batching in our processes, but do we observe the same slow down in the system that Erlang observed in his switches? Yes. For example, consider the following side affects of large batches:

Large batches reduce efficiency and create rework! If you have a large team, and each team member is working on a different feature, a lot of communication is required to prevent each developer from 'bumping into' another developer in the code base. Also with many features happening at one time, common functionality doesn't get pulled out and handled separately in a single location in the code base, unless there is a lot of costly communication. Consider testing a large number of features for a release, to testing a small number of features. With the large list, each time you find one bug you need to check through the list of all the other bugs found to make sure it isn't a duplicate. When the tester describes the bug to the developer, the developer probably doesn't remember working on it since there is a good chance that it was implemented a significant amount of time prior to it being found.

Large batches create even larger batches. Consider the rework example above, where each developer produces their own solution to a common problem. This complicates the code base making it harder to change and increasing the transaction cost for adding a new feature. Before long, the code base is such a mess that the developers will be saying 'Just give us all the features that you need, we have to go lock ourselves in the conference room for the next 3 months and rewrite the entire system'. (Trust me, I've been there).

Large batches lower motivation and the sense of urgency. This point is perhaps the simplest, and has the potential to be the biggest performance factor of a team. Consider a developer with a task that is due in a month, and a developer with a task that is due by the end of the day. Which one is going to feel a greater sense of urgency? Even if the month long task is 40 times more work than the 1 day task, there will be many other developers with a task that is due in a month, and the chances are that one of those guys is going to fail first and integration on that date probably isn't going to happen anyway. A developers world changes pretty quickly, and the chances are pretty good that in 1 month things will be different.

The evidence is strong that larger batch sizes:
# Give you less feedback, causing you to build the wrong thing.
# Delay the opportunity that users can create value with your software.
And...
# Is less efficient that small batches anyway.

Ultimate batch size

If small batches are good, does that mean that we should strive to remove any batching? Yes, but only to a point. We can make our batches smaller by continuously driving down transaction cost (just as XP does), but transaction cost is our constraint on batch size.

In his book, 'The Principles of Product Development FLOW' (where most of this has been shamelessly ripped from) Don Rienersten suggest that batch size reduction often has an even greater impact than you would think and that our estimates of optimal batch size are often too high. One reason for this is that our assumptions on transaction costs are often wrong and that they can be lowered beyond our expectations. This is often discovered when teams start lowering transaction costs and start to see the economies of it. Teams also start to find extra transaction costs in areas where they didn't expect to find them.

Don offers the heuristic that on average, hidden costs of large batch sizes are twice what people expect them to be and that optimum batch size is 70% of their expectations. He also shows that optimal queue utilization follows a 'flat bottomed U' function, enabling teams to make massive improvements to performance without having to do detailed analysis and get their numbers 'exactly right'.

Personal Experience

The science that Don is demonstrating is something that I have experienced first hand (and at the time didn't know why it was working). On 2 different projects lasting a combined 6 years, I pushed batch sizes as low as I could. I ran teams that built software in fast moving and demanding financial companies and I knew that I didn't want to break anything when I released, so I released as little as possible. Of course this also meant as often as possible, which was pretty much at the end of each trading day.

We drove down the transaction cost to make this economical and got really good at releasing software that didn't break because we only released a single days worth of development. We didn't use any QA people (transaction cost) and we had the minimal amount of automated tests (transaction cost). We didn't batch up requirements either. Whenever we had time to work on new stuff for our users, we went to them and asked for new work. They never asked us for status reports (transaction cost) or estimates (transaction cost) because they either got it the next day, or they got something that proved that we were making progress.

We built lots of software in a low stress, high motivation environment. We spent the majority of our time working on the value added activities of 'adding new functionality' and 'reducing transaction-costs'.

The project that I am on now is a different story. We rarely meet our customers except when there is a problem in production. Stories are scheduled into 2 week iterations by intermediaries and are often dependent on other teams working in similar sized batches. Developers work on the same tasks for days at a time spending a large portion of their time circumventing infrastructure and code quality issues. Our checkout/modify/test/checkin cycle time is painfully slow. QA is done by a single manual testing resource. We work on multiple branches of source code and switching between them takes minutes. Until recently, deployments were done manually to windows machines and often included a 'post deployment configuration' step. Several versions of our software can exist in production at anyone time forcing us to maintain backwards compatability with each of them.

It is a long way from being the worst project in the world. We are Agile. But I would estimate that our transaction-costs are probably between 5 and 10 times higher in the larger batch size project, and that our percentage of transaction-cost work in the small batch size project was %40. Do the math and it starts to make sense why Software Engineering departments are consider cost centers.

Conclusion

Reducing the batch sizes in your project is like 'draining the swamp'. When you do it, you will see the non value adding transaction costs that were lurking below the surface. Find ways to reduce those costs, and then drain the swamp some more.

P.S.

Most Agile teams today almost certainly have higher transaction costs than early XP teams. Agile/SCRUM/XP is going backwards in my opinion and missing the point. To be successful, software teams need to generate value by doing value added things, not unnecessary transaction cost work. The Agile manifesto says that we focus on people and interactions over process and tools. Ok, sounds good. But why? How does that generate more value? Agile lost its focus, and I like the idea of using the science behind Don's work to bring it back.

The ideas behind the original Agile Manifesto were hijacked by the people who sell transaction cost. Don't believe me? just google Agile and see the list of tools developers and consultants that come up. Go to Amazon and look at the books on the subject, how many of them are about how to remove a transaction cost? how many are about adding a transaction cost?

I'm starting to think that the Agile movement has had its time in the spotlight. So what's after Agile? I am going to keep an eye on Don Reinertsen and his friends over at the Lean Software Consortium (but I do hope that they change the name).

Tuesday, June 23, 2009

My Software Principles

Everything
# Focus on simplicity and the ability to change.

Shipping
# Software that isn't being used, whether not yet deployed or no longer in use, is a negative contributor to your goals.
# Users know more about what they want when they have software 'in their hands'.
# Measure success by the willingness of the users/stakeholders to spend time with you explaining what they want.
# Get really good at deploying and monitoring, and do it as much as possible.

Design
# Use building blocks to get where you need to go.
# Wrap building blocks with different access layers to make them useful to more things.
# Don't expose the implementation.
# Group together things that change for the same reason, separate things that don't.

Coding
# Make your code, build, test cycle as short as possible.
# Don't Repeat Yourself.
# Simple is better. Less is more.

Testing
# Unit Test (White Box) the important parts as you are building them.
# Regression Test (Black Box) those parts, and everything else.
# Consider the efficiency of your tests. How useful are they are catching problems? How much do they get in your way?

Tuesday, February 3, 2009

Professionalism

I have a new way to think about professionalism in Software Development. I like to contrast it to something that I am not a professional at. The best examples that I can think are the various 'projects' that I undertake on my house that alternatively I could hire someone from a particular trade to do. This might be a general contractor, plumber, electrician or even a painter. I like working on my house myself so generally I try not to hire anyone else. On average I am pretty good at this stuff, but I am definitely not a professional.

I see the differences as follows:


1. Depending on what it is, the end product usually isn't quite as good
2. It almost always takes me longer
3. I make more of a mess while doing it

I attribute these to:

1. I don't do this full time, so I don't have as much experience or practice
2. I don't have the tools or the process dialed in
3. I don't have the organization
4. I don't have a client


So in conclusion, I see the differences between someone who is good at something and someone who is a professional is are that the professional has experience and is able to continually practice their craft, has dialed in tools and process and is well organized towards the task at hand. Having a caring, passionate client is also very important and it helps if that person is not you. I find that sometimes I cut corners when I am the client that I wouldn't if I was working for someone else.

Tuesday, December 2, 2008

Generators with Continuations

I ran into the Generator concept recently and saw that they could be implemented with continuations. Ruby supports continuations but I have never really understood them or found a use for them. There is a Prime number example in Python on the Generators wiki page so I thought I would take a crack at implementing it in Ruby.

I couldn't quite get my head around it so I first built a linear (non continuation) version. Here it is.

class Integer
def prime?
divisible_by.empty?
end

def divisible_by( include_one_and_self=false )
range = include_one_and_self ? (1..self) : (2..self-1)
range.select{ |x| self % x == 0 }
end
end

class LinearPrime
def initialize
@last = 0
end

def next
@last += 1
@last += 1 while !@last.prime?
@last
end
end

>> p = LinearPrime.new
=> #
>> (0..10).collect{ p.next }
=> [1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

Simple stuff, but how would it be done with continuations? Some googling got me to a chapter in the book 'The Ruby Way' which included a generator implementation from Jim Weirich. Here is the Prime number generator that I came up with based on Jim's generator class.
 
class Generator
def initialize
do_generation
end

def next
callcc do |here|
@main_context = here
@generator_context.call
end
end

private
def do_generation
callcc do |context|
@generator_context = context;
return
end
generating_loop
end

def generate(value)
callcc do |context|
@generator_context = context
@main_context.call(value)
end
end
end

class ContinuationPrime < Generator
def generating_loop
number = 1
loop do
number += 1 while !number.prime?
generate( number )
number += 1
end
end
end


Much more complicated than the LinearExample, but it did finally get me to understand continuations. Here is the usage for ContinuationPrime.


>> p = ContinuationPrime.new
=> #ContinuationPrime:0x3376d8 @generator_context=#Continuation:0x33769c
>> (0..10).collect{ p.next }
=> [1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

For those new to continuations, here is how it works.

  1. Initializer calls do_generation which first creates a new continuation (callcc) and assigns it to @generator_context. It then exits the method because it hits a return from inside the block, bypassing the call to generating_loop.

  2. Calling next on the new instance creates a new continuation and assigns it to @main_context, it then calls the @generator_context that was captured in the previous step.

  3. The do_generation method continues after the block that created the @generator_context, invoking the generating_loop method (implemented in the ContinuationPrime subclass.

  4. The generating_loop method only gets called once, it initializes the first prime number to 1 and then goes into a forever loop which increments the number until it is a prime number. When it has a prime it passes it to the generate method in the superclass.

  5. The generate method creates a new continuation and assigns it to @generator_context and then resumes the @main_context that was created in the next method, passing to it the prime number that was generated.

  6. The next method continues after the block that created the @main_context, but first the call to callcc returns with the value that was passed to in when the continuation was resumed. The call to callcc is the last execution of the next method so the return is also the return of next.



Subsequent calls to next take a slightly different path because the @generator_context is now a continuation from generate instead of do_generation.

Still confused?

There were two things that were tripping up my understanding that might be useful to reiterate. The first is regarding the exact point of continuation which seems obvious now but was confusing initially. To be clear, the entire block passed to to callcc is executed during the initial invocation and the continuation starts directly after the call to callcc.

The other thing that was confusing was the parameter passed to the call method on the continuation when invoking it. The parameter is the return value of the callcc when the continuation happens. This is party confusing because the return type is different depending on if it is returning initially or resuming from a continuation. On the initial call (to capture the continuation) the callcc method returns the last value of the block. On the continuation call callcc returns the value that was passed to the continuation's call method. The code below illustrates this difference.

>> def foo
>> @v = callcc{ |c| c }
>> puts "foo has #{ @v }"
>> end
=> nil
>> foo
foo has #Continuation:0x34bb10
=> nil
>> @v.call( "value on continuation" )
foo has value on continuation
=> nil

For this reason, I would suggest only assigning the continuation to a value within the callcc block and using the return value only for values that are valid to the resume of the continuation. For example.

>> def foo
>> value = callcc{ |c| @continuation = c; "initial value" }
>> puts "foo has #{ value }"
>> end
=> nil
>> foo
foo has initial value
=> nil
>> @continuation.call( "resuming value" )
foo has resuming value

Sunday, October 5, 2008

Me Meme



1. Take a picture of yourself right now.
2. Don’t change your clothes, don’t fix your hair…just take a picture. (should be super-easy with Photobooth)
3. Post that picture with NO editing.
4. Post these instructions with your picture

Tuesday, July 22, 2008

Agile Measure

How long does it take to upgrade a fundamental part of your system?


For example, say you are on Java 1.5, how long would it take to goto Java 1.6?

Specifically, are there special cases that you need to concern yourself with? Can you upgrade this but not that? If you change this will you have to change that?

How long will it take you to rollout this change? Is it a fairly automated process or will it take a lot of manual work?

How long will it take you to test the entire system (each environment) with the change?

Technical debt and resistance to change are the enemies of software projects. Use this case, whether real today or hypothetical for the future, to slim down your project and make it run faster.

Friday, May 9, 2008

Lollapalooza 2008




The lineup this year is amazing!

Here is a list of bands that are playing that I wouldn't hesitate to buy tickets for if they came into town:

Radiohead
The Raconteurs
Gnarls Barkley
Bloc Party
Marc Ronson
Booka Shade
Louis XIV
The Ting Tings
Foals

And here are a bunch more that I can't wait to see:

Kanye West
Nine Inch Nails
Lupe Fiasco
The National
G. Love & Special Sauce
The Go! Team
The Whigs
The Kills

Can't wait...