Friday, April 29, 2016

Understanding Cost of Delay (Part 4): WSJF - the "divisor"

Note: terms in boldface are defined in the Glossary of  Essential Kanban Condensed which is available here. To get the background to this piece check out these previous posts:
Part 1: Understanding Cost of Delay and its Use in Kanban
Part 2:
Delay Cost and Urgency Profiles
Part 3:
How to Calculate WSJF
Part 4:
In Part 3 we established why the factor used for prioritising work items is urgency divided by the development delay (U/D). The item to be done first should have the highest value for this term (sometimes referred to as the "wisjif" or CD3). Urgency is the rate of decay of the business value (the Delay Cost per week) and we must estimate both the business value and Delay Cost Profile to derive this. In this post however we focus on the other variable. What is the appropriate value to use for D?

OK I'm going to tell you my conclusion before looking at why. It's a surprising conclusion (at least for me). My conclusion is that you should use "size", or a proxy for size like the estimated number of "user stories" in the work item, rather that the period of time before the item is released (Customer Lead Time). Mmm... if that's surprising to you (or if you've no idea why it might be surprising) read on!

Why use "size" rather than Customer Lead Time in WSJF?

To me the "first-glance" obvious answer to the question "What is D?" is Customer Lead Time. The business value is not realised until the item is delivered and "live". So the delay we are talking about is the time from the decision to implement (known as the commitment point in Kanban) to the release date; in other words, the Customer Lead Time. Some people have suggested that an estimate of the "size" of the item in some units (such as number of stories or story points) is an effective proxy for Lead Time. In fact it is a very poor proxy for this. (See for example Ian Carroll's blog [6] looking at correlation between size and Lead Time. The correlation is very weak, possibly non-existent.) The reason for this is low Flow Efficiency - the ratio of time working on an item to elapsed time. If Flow Efficiency is in single figures (typical for many teams) it is not surprising that size does not correlate well with Lead Time. Therefore we can't use size as a proxy for Lead Time. So why did I conclude that size is the correct divisor for wisjif?

Let's go back to the derivation of WSJF in the previous article (How to Calculate WSJF). The assumptions we used were that: the urgency was constant over the period of interest; and importantly, that the team's WiP limit was 1. Basically we assumed the second feature had to wait until the first feature had been delivered before we started on the next feature. In these circumstances the delay, is equal to Customer Lead Time - both for the wait until benefit occurs and for how long the previous item holds up the product team before it can start the next item. In reality these are two different wait times - provided that the WiP limit is allowed to be greater than one. The delay before benefit occurs is still the Customer Lead Time (let's call this T), but the team is held up by less that the Customer Lead Time - they can work on another work item while the first item is held up by a blocker or waiting for release. This is a much more realistic assumption than WiP=1 provided what we are talking about is a product feature, not a project.

This change in assumption changes the equation for the value realised by implementing item 1 followed by item 2. In the previous article we found this to be:

Now we are considering that the time during which the team is held up, is a different and shorter time than the time before the value is realised. Let's say the teams working on this product have capacity to deliver "stories" at an average rate of C stories per week. and that the estimated number of stories in the two work items are sand s2

So the amount of time that the second item is held up by the first item is s1/C. The rest of the Customer Lead Time, T, is waiting time - let's call that w. So...
The value realised from item 1 followed by item 2 is now seen to be:
Again subtracting this same formula with the order of the items reversed (and seeing most of the terms cancel out), gives us the difference in value between the alternative orderings, as:
We can see from this formula that it is the term urgency divided by size for the 2 items (U/s) that determines which order is best. We do not need estimates of lead times for the items to find the optimum order for the work items.

See important note on assumptions below.

What if the "urgency" is not a constant?

What about the other important assumption in the simple WSJF formula - that the Urgency (Delay Cost per week) is a constant? In general, Urgency is not constant for work items over the whole period that there is still value in implementing them. However this does not matter if the Urgency is constant during the period that the competing items to be ordered will be implemented. In this case we can just go ahead and use the formula.

For "Fixed Date" items the formula is not appropriate. The determinant for when Fixed Date items should be started is the "last responsible moment", taking into account uncertainty in Customer Lead Time, and the degree of risk that is acceptable to the customer. The determinant for whether Fixed Date items should be started is the total value of the item, compared with the loss of value that occurs by delaying the next highest item to be prioritised. Usually we can just start Expedite items immediately and Fixed Date items before the last responsible moment without the need for estimation or calculation, making WSJF important only for the ordering of Standard items.

Intangible items would not be selected at all if we only applied the WSJF formula, since their immediate urgency is low. Nevertheless it is helpful to always include some Intangible items in the schedule for flexibility (if customer SLAa are threatened), and for preparation for future events. Policies around the use of Intangible items can be tuned to the business context and strategy.

In the next blog in this series we will consider some conclusions from this analysis of Cost of Delay and WSJF. Why is it important? When is it applicable? What to do when it is not applicable or not calculable?

Important Note on Assumptions (continuous delivery or batch)

The algebra above assumes that the waiting time part of the delay (w) does not vary for a given work item, regardless of whether it is implemented before or after another item. This is probably a reasonable assumption if these are "features" which are released as soon as they are completed, in some kind of continuous delivery process. If a batch delivery process is used (e.g. a release every 2 months), the delay is identical for features in the same batch. It would be wasteful to use WSJF for features within the same batch. The issue is which batch the feature is put in and this analysis probably needs to be more sophisticated (or simply qualitative rather than quantitative - see for example [7]) - to ensure items are in the right batch.

Wednesday, April 27, 2016

Understanding Cost of Delay (Part 3): Calculating WSJF

In part one of this series of blogs on Understanding Cost of Delay and its Use in Kanban, we considered the meaning and difference between Delay Cost and Urgency (or Cost of Delay). In part two we looked at different Delay Cost and Urgency Profiles and the archetypes defined in Kanban for classifying work items by these profiles. Now we look at the prioritisation/ordering technique know as Weighted Shortest Job First (WSJF): the formula, the assumptions behind it and how the formula arises. WSJF brings the primacy of time into decision making about which item to implement and when.

Consider a product development team. They have many ideas for what to add or change in the product, and for improving the way they work. The question is, which of these many useful things should be done first. It turns out the that the total business value of a proposal is not the the deciding factor in maximising the business value a team can deliver in a given period; nor is it urgency of the proposal (the Delay Cost per unit of time). The deciding factor is the urgency divided by duration of implementation, a term sometimes referred to as the WSJF (or "wisjif") of the item.

To see why let's consider 2 work items with a total cumulative value of V, a duration of D, and an urgency of U. Suffices will indicate which of the 2 work items is being referred to. Assuming the WiP limit in our team is 1 (so the team does only 1 feature at a time), and assuming the urgency, U, is a constant over the period of interest, the estimated value realized by the 2 features will be:
 Total value arising from implementing Item 1 followed by Item 2 For more information see Essential Kanban Condensed
This is the net present value of the two items less each item's delay cost. In the case of the first item, the delay is just its own duration, but in the case of the second item, it must wait for the first item as well. If we want to know whether it is better to do item 1 first or item 2, we need to know which has the higher urgency (Delay Cost per time period). We can visualise the delay cost like this... it is the total area in these graphs.
 Feature 1 then Feature 2
 Feature 2 then Feature 1

Switching the terms over in the formula above, and subtracting, gives us the difference in value realised by changing the order. Most of the terms cancel out, but we are left with the following, for the addition benefit (cost if negative) of doing item 1 before item 2.
This gives us the basis of WSJF. To maximise business value delivered by the team, we should prioritise the items which have a highest value for urgency divided by duration. The "wisjif" term may thus be expressed as:

WSJF = U / D

In the next article in this series we will look at whether the duration used in this formula should be Customer Lead Time, System Lead Time or something else. we will also consider the assumptions behind the WSJF formula. This will lead us to suggest how the formulae can be used in practice, in conjunction with delay cost profiles for different categories of items.

Read part 4 now: WSJF - Should you divide by Lead Time or by Size?

Back to part 1: Understanding Cost of Delay and its Use in Kanban

Monday, April 25, 2016

Understanding Cost of Delay (Part 2): Delay Cost and Urgency Profiles

In part one of this series of blogs on Understanding Cost of Delay and its Use in Kanban we explored how - from understanding the business benefit that is likely to occur following the decision to implement a proposal now or later - we can derive
 Value Flow, Cumulative Value, Delay Cost  and Urgency for time-sensitive feature (click on image for more detail)
1. the value flow (net benefits, positive and negative) each week through the useful life of the proposal, for a given release date
2. the change in cumulative value (Net Present Value, NPV, or life-time profits) as a  function of time, for a given release date
3. the Delay Cost Profile - how much business value is lost as a function of the delay
4. the Urgency Profile (the rate at which value is lost as a function of the delay)
Note: The terms Cost of Delay (CoD), Class of ServiceDelay Cost, Lead Time, NPVwork item and Urgency, as well as over 60 commonly used terms in Kanban and Lean are defined in the Kanban Glossary in Essential Kanban Condensed [2] (currently available as a free download).

For the type of work item that was considered in part 1 (a product feature in a time-limited competitive market), the four curves are shown above: value flow, cumulative value, delay cost (as a function of the delay), and urgency (as a function of the delay).

This feature shows a diminishing rate of cost of delay (urgency), due to the twin effects of a reduced peak in earnings and reduced period of earning, the longer the feature is delayed.

What if we were examining a different type of work item which was estimated to save a certain amount of work each week, work which is currently being contracted out to external staff? In other words the same savings would occur every week for the foreseeable life of the product. Here is an estimated projection for the 4 curves in this case ...
 Value Flow, Cumulative Value, Delay Cost and Urgency for a feature providing constant benefit for a period of time
In this case the cumulative NPV is more or less a straight line (bending downwards slightly due to the present value discount), and it results in a CoD profile which is also more or less a straight line with the same gradient (bending upwards slightly). Straight line CoD profiles result in constant urgency which we can see (approximately) in the final graph in the series.

Different again - what about an item that would save a penalty fine from a regulator if a certain issue is not addressed by a fixed date? Here are the curves ...
Cash Flow, Cumulative NPV, Cost of Delay and Urgency
for a feature providing step-function in benefit at a fixed date

This work item displays a sudden step-function in cumulative NPV at the point the fine would be applied, and a similar step-function in the CoD about 10 weeks before the date of the fine, since development Lead Time is estimated to be 10 weeks. The urgency profile is a spike - no urgency up to the "last responsible moment" when work must start, and no urgency after this point since you would then have passed the "first irresponsible moment"; there is no avoiding the fine after that point! In reality the CoD and Urgency profiles should be smoother since there is uncertainty in the estimate, and leaving it to the last moment increases the risk of incurring higher costs in order to hit the date, or indeed of missing the date due to unforeseen circumstances.

 Value Flow, Cumulative Value, Delay Cost and Urgency for a feature providing constant benefit for a period beginning at a fixed date
Finally consider the case where the savings of staff (similar to the second scenario above) would not start until a fixed date. See graphs to the right.

We can see this case effectively combines the previous two, with a period of low or negative Delay Cost, followed by approximately linear Delay Cost up to the end of the opportunity.

We have taken some time here to look at the 4 curves (Value Flow, Cumulative Value, Cost of Delay and Urgency) for these 4 different types of feature because it is easy to confuse between them. In the case of the "constant benefit" item, the Cumulative Value and Delay Cost look almost identical, with the same units on both axes. This has caused some confusion and some inaccurate statements about the use of cost of delay. Take care!

One of the observations to make about the graphs shown so far is that to estimate and derive them accurately for real features is difficult and error-prone. While this is true, one should not conclude from it that we should therefore estimate a completely different entity, which is easier but not well correlated with the scheduling decisions we wish to make! (Sadly, you may also come across some advice like that.)

However it does suggest that looking at archetype profiles for different types of work item may be helpful. Picking from a menu of profiles, based on the type of feature proposed e.g. diffentiator in a competitive market, cost-saver in a monopoly market, fixed date cost-saver, and so on, is much simpler than trying to derive these curves from scratch. Some tools for Kanban are already offering such features. The profile can be combined with order of magnitude estimates of value and urgency (e.g. using a series such as 1, 2, 5,10, 20, 50, 100, 200, 500, 1000, etc.).

Kanban [3] defines 4 archetypes with different Delay Cost Profiles which are typically used to differentiate Classes of Service. Black Swan Farming [4] also suggests some typical profiles. The Kanban archetypes do not correspond exactly to the types of feature discussed above, though there is some overlap.
 Kanban's Delay Cost Archetypes, from Essential Kanban Condensed

The archetypes show 4 Delay Cost Profiles:
1. Expedite items have very high urgency (high CoD) and there is no end in sight to the cost - if you wait, the losses don't come to an end. It's a straightforward decision - do it now!
2. The Fixed Date items also have high impact but only if you miss the deadline. The scheduling imperative here is to make sure you start before the last responsible moment and deliver before the deadline.
3. The Standard profile is approximately linear to start with and tails off or cuts off as the opportunity loses value. Standard items should therefore be done as soon as possible and scheduled relative to each other according to the degree of urgency and the item's size (see later discussion of WSJF).
4. Finally, Intangible items have an apparently low urgency. One might ask why do them? Two reasons. The intangible profile does indicate a rise in urgency - possibly a steep rise - will happen in the future. It is useful to make some progress on these items even though the impact in the short term is likely to be low. In addition having some items in the schedule which are "interruptible" makes the system more resilient in the event of expedite items having to be handled, or events which threaten the service level agreement for standard items.
So how might a workshop gather the quantified information that we need for scheduling the work item options based on cost of delay, particularly if we do not currently have a menu of items to pick from. Here's a generalised profile of cost of delay and urgency that (roughly) covers all the profiles we have discussed within the precision we could reasonably expect from the workshop.

Using this profile we can ask for 3 parameters that give enough detail for us to schedule the items. There are 2 dates (t1 and t2) and the slope of the CoD line (or urgency). Before t1 there is low or zero CoD - it's the "CoD low until date" (CLUD). After t2 there is also low or zero CoD - it's the "CoD low after date" (CLAD).

Armed with this information about delay cost and urgency profiles, we can now move forward to consider the WSJF method itself. To use it we need information about the urgency, the urgency profile and the duration taken by implementation of the work item.

Read part 3 now: How to calculate WSJF

Back to part 1: Understanding Cost of Delay and its Use in Kanban

Friday, April 15, 2016

Understanding Cost of Delay and its Use in Kanban

Cost of Delay (CoD) is a vital concept to understand in product development. It should be a guide to the ordering of work items, even if - as is often the case - estimating it quantitatively may be difficult or even impossible. Analysing Cost of Delay (even if done qualitatively) is important because it focuses on the business value of work items and how that value changes over time. An understanding of Cost of Delay is essential if you want to maximise the flow of value to your customers.

Don Reinertsen in his book Flow [1] has shown that, if you want to deliver the maximum business value with a given size team, you give the highest priority, not to the most valuable work items in your "pool of ideas," not even to the most urgent items (those whose business value decays at the fastest rate), nor to your smallest items. Rather you should prioritise those items with the highest value of urgency (or CoD) divided by the time taken to implement them. Reinertsen called this approach Weighted Shortest Job First or WSJF (sometimes pronounced wizjiff!). WSJF is a variation of the Shortest Job First scheduling algorithm used in computer operating systems. By adding urgency as a weighting, both time and value contribute to the decision-making.

In this series of articles, of which this is the first, we return to the topic of Cost of Delay (previously addressed 3 years ago in Selecting Backlog Items By Cost of Delay), and how CoD can be applied in Kanban. I'll explain the terminology used in the recently published book Essential Kanban Condensed [2] - including why this differs slightly from that used by some other authors - and how you can apply this knowledge in Kanban, potentially combining it with the use of Classes of Service. In particular we will look at the simple mathematics applied in WSJF and the assumptions that make it valid or invalid in different circumstances.

Here are the links to the articles in this series:

Part 1: Understanding Cost of Delay and its Use in Kanban (this article)
Part 2: Delay Cost and Urgency Profiles
Part 3: How to Calculate WSJF
Part 4: WSJF - Should you divide by Lead Time or by "Size"?
Part 5: Others may follow...
Let's start with some definitions, by looking at a particular work item, a proposal for a new feature in a software product. Let's assume that we've already carried out some analysis of this feature and the competitive market in which the product operates. As a result we can forecast the value flow - in this case the net profit each week - that will result from the implementation and exploitation of the feature.

Here's what the weekly business benefit graph looks like...

To know what the Cost of Delay is for this feature we need to estimate what the business benefits would be if we delayed starting this work and instead started in say 10 weeks or 20 weeks time. Here's a comparison of these 3 different cash flows, with no delay, 10 weeks delay and 20 weeks delay.

The analysis seems to be forecasting that not only will the peak revenue be lower by entering the market later, the time period for exploiting the feature profitably is also shorter. To see the effect of this on the overall value of the feature (calculated as a net present value or alternatively total life-time profits), it is useful to plot a cumulative value graph, see below...

Now we can see what the value of this feature is if it is implemented without delay - about \$420K. We can also see the loss of value - the Delay Cost - for a 10-week and a 20-week delay.

The next step is to plot the Delay Cost against the length of the delay. This graph is referred to as the Delay Cost Profile. There are a number of archetypes that different authors have identified [3, 4] that can help us identify the likely profile in given scenarios. We'll look at these in more detail in Part 2 of this series. Here's the Delay Cost Profile for our feature:
This shows our feature is losing value most rapidly right now! As value is lost so the rate at which value is lost is also diminishing. At a certain point the projected profit from the feature becomes less than the development cost so there is no value in implementing the feature and no further Delay Cost.

We refer the rate at which value is lost as Urgency or Cost of Delay (the first derivative of Delay Cost). It is important when reviewing materials on CoD - particularly when looking at graphs plotted against time - to clarify whether the term referred to is Delay Cost measured in currency (e.g. \$) or Urgency/Cost of Delay (measured in currency per length of delay, e.g. \$ per week). It is unfortunate that the 2 terms commonly used here - Delay Cost (\$) and Cost of Delay (\$ per week) - are so close in natural meaning (translators have nightmares!). For this reason the Kanban glossary suggests an alternative term for the rate at which value and Delay Cost changes - Urgency.

When Urgency is plotted against the date of availability (or the length of delay, if starting from a reference date), the graph is referred to as the Urgency Profile.

Here is the plot of the Urgency Profile for our example:

We can see from this graph that Urgency is diminishing in this case, as the market opportunity is also disappearing. Reinertsen and Preston Smith [5] noted that the sense of urgency in organisations often runs in the opposite direction to the market opportunity - they named it the Urgency Paradox, the "cruel tendency" for this sense of urgency in product development to be highest when the real urgency, as reflected by market opportunity, is lowest and vice versa.

We will see in future articles in this series how different kinds of work item have different Delay Cost and Urgency profiles, and how we can use this with WSJF to help the scheduling of work to maximise the delivery of business value. We'll also examine the degree to which a quantified approach (using numerical estimates of business value and its rate of decay) can be used in practice, and whether alternative approaches such as the use of profile archetypes with scaling factors can be as or more effective.
Note: This article has been updated since its first publication to be consistent with the latest version of the glossary in Essential Kanban Condensed [2].

Now read part 2: Delay Cost and Urgency Profiles

The Cost of Delay Series:

Part 1: Understanding Cost of Delay and its Use in Kanban (this article)
Part 2:
Delay Cost and Urgency Profiles
Part 3:
How to Calculate WSJF
Part 4:
WSJF - Should you divide by Lead Time or Size?

References

[1] Donald G. Reinertsen. The Principles of Product Development Flow, (United States: Celeritas Publishing. 2009)

[2] David J. Anderson and Andy Carmichael, Essential Kanban Condensed. (United States: Lean Kanban University Press. 2016)

[3] David J. Anderson. Kanban: Successful Evolutionary Change for Your Technology Business (United States: Blue Hole Press, 2010)

[4] Joshua Arnold and Ã–zlem YÃ¼ce. “Using Cost of Delay: Experience Report – Maersk Line.” Black Swan Farming. (2013)

[5] Preston G. Smith and Donald G. Reinertsen. Developing Products in Half the Time. (United States: John Wiley and Sons. 1998)

[6] Ian Carroll. “No Correlation Between Estimated Size and Actual Time Taken.” IanCarroll.com. (2016)

[7] Joshua Arnold. "Qualitative Cost of Delay." Black Swan Farming. (2016)

Thursday, March 17, 2016

Kanban's Survivability Agenda and Antifragility

A conversation on the kanbandev online forum has triggered this post. The discussion concerns how evolutionary change is applied, particularly when the fitness landscape is changing to such a degree that large rather than small steps are needed to survive in the new competitive environment. It got me thinking that we must consider evolutionary change on more than one level if we want to address what the Kanban Method calls its Survivability Agenda.

The first mystery to consider is how evolution jumps across valleys in the fitness landscape. Seems to me there are 3 possibilities. You could make large leaps in what you think are promising directions. Doesn't sound a great idea because you're doing reasonably well as you are (different if you know you face an imminent existential threat, but it has the same likely outcome). You could wait for the peak you're climbing to decline in the fitness landscape, to the stage when small steps will move you off it. That's probably going to be too late. Or you rely on diversity. Your peak may be declining and you may - if trends continue - be doomed, but others are in better spaces and they will grow.

The final option sounds like disaster. But I think it is the way evolution works. Processes and technologies evolve much faster than biological organisms (see Eric Beinhocker's Origin of Wealth for more discussion of this) because the cycles of copying with differences, selection, amplification/damping are much shorter. Not only that, they are accelerating, which is what is now so threatening to large organisations. Does this mean large organizations must sit back and let the inevitable happen? Of course not. The key is to have multiple fragile parts, so the organization itself is more antifragile.

In Antifragility N N Taleb discusses how hierarchies can gain antifragility by allowing fragility within them. And also how natural antifragility can be irresponsibly eroded, if higher structures in the hierarchies (like governments) absorb the fragility of structures within them (like banks). Back to the Dinosaurs - they were antifragile as a species (genus? - I don't know; biologists please excuse) to most changes in the fitness landscape less than massive climate change. But since that was the limit of their antifragility they died out. But the higher level in the hierarchy (life on earth) survived (just) because there were some funny rat-like creatures running around scratching a living beneath the dinosaurs feet. They found themselves in the foothills of some pretty small peaks of the fitness landscape, and made the most of it.

So the levels in the hierarchy of Kanban (e.g. Personal, Team, Product, Portfolio) and its stress on the exploitation of real options, are keys to its "Survivability Agenda".* Portfolio management is key. It is where antifragility of the organisation can be built or lost. Portfolio Management must decide what level of investment different products and product ideas receive, and for how long before the return must be tangible. In a stable fitness landscape they might consider that the one successful product they have, should receive all the investment. This builds a monoculture which is vulnerable to shifts in the landscape. Keeping options has a cost but preserves the antifragility at the higher scale. Diversity within the organization and a culture which encourages innovation, learning and experimenting will build greater survivability. Note that in part this is because it tolerates and encourages more fragile technologies and processes within it. They are limited in their ability to survive - indeed they need to maintain the differences from more successful instances, precisely so that diversity is preserved. Eric Bienhocker has an excellent account of Microsoft's use of options when developing Windows. They also had teams investing in OS/2, Apple and Unix. Clearly it would not have been helpful if the Unix team say, thought the OS/2 option was better and started working on that instead of Unix.

In summary, I don't think Kanban provides any magic bullets here. Hopefully it exposes the issues in building resilient or antifragile organisations but it is down to the strategists, managers and leaders within these organisations as to how the tools and insights might be applied. Different groups make different choices. There is no recipe. That in my opinion why it remains one of the most interesting and important methods around.