Friday, April 29, 2016

WSJF - Should you divide by Lead Time or Size?

Understanding Cost of Delay (Part 4): WSJF - the "divisor"

This article is the fourth in a series of blogs on Cost of Delay (CoD) and Weighted Shortest Job First (WSJF).

Note: terms in boldface are defined in the Glossary of  Essential Kanban Condensed which is available hereTo get the background to this piece check out these previous posts:
Part 1: Understanding Cost of Delay and its Use in Kanban
Part 2: 
Cost of Delay Profiles
Part 3: 
How to Calculate WSJF
Part 4: 
WSJF - Should you divide by Lead Time or Size? (this article)
Part 5: Others may follow
In Part 3 we established why the factor used for prioritising work items is urgency divided by the development delay (U/D). Whichever item has the highest value for this term (sometimes referred to as the "wisjif") should be done first. Urgency is the rate of decay of the business value (the Cost of Delay per week) and we must estimate both the business value and CoD profile to derive this. In this post however we focus on the other variable. What is the appropriate value to use for D?

OK I'm going to tell you my conclusion before looking at why. It's a surprising conclusion (at least for me!). My conclusion is that you should use "size", or a proxy for size like the estimated number of "user stories" in the work item, rather that the period of time before the item is released (Customer Lead Time). Mmm... if that's surprising to you (or if you've no idea why it might be surprising) read on!

Why use "size" rather than Customer Lead Time in WSJF?

To me the "first-glance" obvious answer to the question "What is D?" is Customer Lead Time. The business value is not realised until the item is delivered and "live". So the delay we are talking about is the time from the decision to implement (known as the commitment point in Kanban) to the release date; in other words, the Customer Lead Time. Some people have suggested that an estimate of the "size" of the item in some units (such as number of stories or story points) is an effective proxy for Lead Time. In fact it is a very poor proxy for this. (See for example Ian Caroll's blog [6] looking at correlation between size and Lead Time. The correlation is very weak, possibly non-existent!) The reason for this is low Flow Efficiency - the ratio of time working on an item to elapsed time. If Flow Efficiency is in single figures (typical for most teams) it is not surprising that size does not correlate well with Lead Time. Therefore we can't use size as a proxy for Lead Time. So why did I conclude that size is the correct divisor for wisjif?

Let's go back to the derivation of WSJF in the previous article (How to Calculate WSJF). The assumptions we used were that: the urgency was constant over the period of interest; and importantly, that the team's WiP limit was 1. Basically we assumed the second feature had to wait until the first feature had been delivered before we started on the next feature. In these circumstances the delay, is equal to Customer Lead Time - both for the wait until benefit occurs and for how long the previous item holds up the product team before it can start the next item. In reality these are two different wait times - provided that the WiP limit is allowed to be greater than one. The delay before benefit occurs is still the Customer Lead Time (let's call this T), but the team is held up by less that the Customer Lead Time - they can work on another work item while the first item is held up by a blocker or waiting for release. This is a much more realistic assumption than WiP=1.

This change in assumption changes the equation for the value realised by implementing item 1 followed by item 2. In the previous article we found this to be:

Now we are considering that the time during which the team is held up, is a different and shorter time than the time before the value is realised. Let's say the teams working on this product have capacity to deliver "stories" at an average rate of C stories per week. and that the estimated number of stories in the two work items are sand s2

So the amount of time that the second item is held up by the first item is s1/C. The rest of the Customer Lead Time, T, is waiting time - let's call that w. So...
The value realised from item 1 followed by item 2 is now seen to be:
Again subtracting this same formula with the order of the items reversed (and seeing most of the terms cancel out), gives us the difference in value between the alternative orderings, as:
We can see from this formula that it is the term urgency divided by size for the 2 items (U/s) that determines which order is best. We do not need estimates of lead times for the items to find the optimum order for the work items.

See note on assumptions below.

What if the "urgency" is not a constant?

What about the other important assumption in the simple WSJF formula - that the urgency (CoD per week) is a constant? In general, urgency is not constant for work items over the whole period that there is still value in implementing them. However this does not matter if the urgency is constant during the period that the competing items to be ordered will be implemented. In this case we can just go ahead and use the formula. 

For "Fixed Date" items the formula is not appropriate. The determinant for when Fixed Date items should be started is the "last responsible moment", taking into account uncertainty in Customer Lead Time, and the degree of risk that is acceptable to the customer. The determinant for whether Fixed Date items should be started is the total value of the item, compared with the loss of value that occurs by delaying the next highest item to be prioritised. Usually we can just start Expedite items immediately and Fixed Date items before the last responsible moment without the need for estimation or calculation, making WSJF useful only for the ordering of Standard items. 

Intangible items would not be selected at all if we only applied the WSJF formula, since their immediate urgency is low. Nevertheless it is helpful to always include some Intangible items in the schedule for flexibility (if customer SLAa are threatened), and for preparation for future events. Policies around the use of Intangible items can be tuned to the business context and strategy.

In the next blog in this series we will consider some conclusions from this analysis of Cost of Delay, Urgency and WSJF. Why is it important? When is it applicable? What to do when it is not applicable or not calculable?

Important Note on Assumptions (continuous delivery or batch)

The algebra above assumes that the waiting time part of the delay (w) does not vary for a given work item, regardless of whether it implemented before or after another item. This is probably a reasonable assumption if these are "features" which are released as soon as they are completed, in some kind of continuous delivery process. If a batch delivery process is used (e.g. a release every 2 months), the delay is identical for features in the same batch. It would be wasteful to use WSJF for features within the same batch. The issue is which batch the feature is put in and this analysis probably needs to be more sophisticated (or simply quantitative - see [7]) to ensure items are in the right batch.

Read part 5: Not yet available!

Back to part 1: Understanding Cost of Delay and its Use in Kanban

Wednesday, April 27, 2016

How to calculate WSJF

Understanding Cost of Delay (Part 3): Calculating WSJF

In part one of this series of blogs on Understanding Cost of Delay and its Use in Kanban, we considered the meaning and definitions of Cost of Delay (COD) and Urgency (U). In part two we looked at Cost of Delay Profiles and the archetypes defined in Kanban for classifying work items. Now we look at the prioritisation/ordering technique know as Weighted Shortest Job First: the formula, the assumptions behind it and how the formula arises. WSJF brings the primacy of time into decision making about which item to implement and when.

Consider a product development team. They have many ideas for what to add or change in the product, and for improving the way they work. The question is, which of these many useful things should be done first. It turns out the that the total business value of an item is not the the deciding factor in maximising the business value a team can deliver in a given period; nor is it urgency (the Cost of Delay per unit of time). The deciding factor is the urgency divided by duration of implementation, a term sometimes referred to as the WSJF (or "wisjif") of the item.

To see why let's consider 2 work items with a value of V, a duration of D, and an urgency of U. Suffices will indicate which of the 2 work items is being referred to. Assuming the WiP limit in our team is 1 (so the team does only 1 feature at a time), and assuming the urgency, U, is a constant over the period of interest, the estimated value realized by the 2 features will be:
Total value arising from implementing Item 1 followed by Item 2
For more information see Essential Kanban Condensed
This is the present value of the two items less the cost of delay. In the case of the first item, the delay is just its own duration, but in the case of the second item, it must wait for the first item as well. If we want to know whether it is better to do item 1 first or item 2, we need to know which has the higher cost of delay. We can visualise the cost of delay like this... it is the total area in these graphs.


Switching the terms over in the formula above, and subtracting, gives us the cost of delay. Most of the terms cancel out but we are left with the following, for the addition benefit (cost if negative) of doing item 1 before item 2.

This gives us the basis of WSJF. To maximise business value delivered by the team, we should prioritise the items which have a higher value for urgency divided by duration.

In the next article in this series we will look at whether the duration used in this formula should be Customer Lead Time, System Lead Time or something else. This will lead us to conclude how the formulae can be used in practice, in conjunction with the cost of delay profiles for the items.

Read part 4 now: WSJF - Should you divide by Lead Time or by Size?

Back to part 1: Understanding Cost of Delay and its Use in Kanban

Monday, April 25, 2016

Cost of Delay Profiles

Understanding Cost of Delay (Part 2): CoD and Urgency profiles

In part one of this series of blogs on Understanding Cost of Delay and its Use in Kanban we explored how - from understanding the business benefit that is likely to occur following the decision to implement a work item now or later - we can derive 
  1. the value flow (cash flow, positive and negative) through the useful life of the work item
  2. the change in cumulative value (Net Present Value, NPV) as a  function of time, 
  3. the Cost of Delay profile (how much business value is lost as a function of the delay), and 
  4. the Urgency profile (the rate at which value is lost as a function of the delay)
Note: The terms Cost of Delay (CoD), Class of Service, Lead Time, work item, NPV and Urgency (U) as well as over 60 commonly used terms in Kanban and Lean are defined in Essential Kanban Condensed (currently available as a free download). The Glossary contains over 60 commonly used terms in Kanban and Lean.

For the type of work item that was considered in part 1 (a product feature in a time-limited competitive market), here are the four curves: cash flow, cumulative value, cost of delay (as a function of the delay), and urgency (as a function of the delay) ...

Cash Flow, Cumulative NPV, Cost of Delay and Urgency
for a time-sensitive feature in competitive market
(click on image for more detail)
This feature shows a diminishing rate of cost of delay (urgency), due to the twin effects of a reduced peak in earnings and reduced period of earning, the longer the feature is delayed.

What if we were examining a different type of work item which was estimated to save a certain amount of work each week, work which is currently being contracted out to external staff? In other words the same savings would occur every week for the foreseeable life of the product. Here is an estimated projection for the 4 curves in this case ...
Cash Flow, Cumulative NPV, Cost of Delay and Urgency
for a feature providing constant benefit for a period of time

In this case the cumulative NPV is more or less a straight line (bending downwards slightly due to the present value discount), and it results in a CoD profile which is also more or less a straight line with the same gradient (bending upwards slightly). Straight line CoD profiles result in constant urgency which we can see (approximately) in the final graph in the series.

Different again - what about an item that would save a penalty fine from a regulator if a certain issue is not addressed by a fixed date? Here are the curves ...
Cash Flow, Cumulative NPV, Cost of Delay and Urgency
for a feature providing step-function in benefit at a fixed date

This work item displays a sudden step-function in cumulative NPV at the point the fine would be applied, and a similar step-function in the CoD about 10 weeks before the date of the fine, since development Lead Time is estimated to be 10 weeks. The urgency profile is a spike - no urgency up to the "last responsible moment" when work must start, and no urgency after this point since you would then have passed the "first irresponsible moment"; there is no avoiding the fine after that point! In reality the CoD and Urgency profiles should be smoother since there is uncertainty in the estimate, and leaving it to the last moment increases the risk of incurring higher costs in order to hit the date, or indeed of missing the date due to unforeseen circumstances.

Finally consider the case where the savings of staff (similar to the second scenario above) would not start until a fixed date. Here they are ...
Cash Flow, Cumulative NPV, Cost of Delay and Urgency
for a feature providing constant benefit for a period beginning at a fixed date

We can see this case effectively combines the previous two, with a period of low or negative CoD, followed by approximately linear CoD up to the end of the opportunity.

We have taken some time here to look at the 4 curves (Cash Flow, Cumulative NPV, Cost of Delay and Urgency) for these 4 different types of feature because it is easy to confuse between them. In the case of the "constant benefit" item, the Cumulative NPV and CoD look almost identical. This has caused some confusion and some inaccurate statements about the use of CoD. Take care!

One of the observations to make about the graphs shown so far is that to estimate and derive them for real features would be difficult and error-prone. While this is true, one should not conclude from it that we should therefore estimate a completely different entitiy, which is easier but not well correlated with the scheduling decisions we wish to make! (Sadly, you may also come across some advice like that.)

However it does suggest that looking at archetype profiles for different types of work item may be helpful. Kanban [4] for example, defines 4 archetypes for CoD which are typically used to define different Classes of Service. Black Swan Farming [5] also suggests some archetypes. The Kanban archetypes do not correspond exactly to the types of feature discussed above, though there is some overlap.
Kanban's Cost of Delay Archetypes, from Essential Kanban Condensed
The archetypes show 4 CoD profiles:
  1. Expedite items are very urgent (high CoD per week) and there is no end in sight to the cost - if you wait the losses don't come to an end. It's a straightforward decision - do it now! 
  2. The Fixed Date items also have high impact but only if you miss the deadline. The scheduling imperative here is to make sure you start before the last responsible moment and deliver before the deadline. 
  3. The Standard profile is approximately linear to start with and tails off as the opportunity loses value. Standard items should therefore be done as soon as possible and scheduled relative to each other according to the degree of urgency and the item's size (see later discussion of WSJF). 
  4. Finally, Intangible items have an apparently low urgency. One might ask why do them? Two reasons. The intangible profile does indicate a rise in urgency - possibly a steep rise - will happen in the future. It is useful to make some progress on these items even though the impact in the short term is likely to be low. In addition having some items in the schedule which are "interruptible" makes the system more resilient in the event of expedite items having to be handled, or events which threaten the service level agreement for standard items.
So how might a workshop gather the quantified information that we need for scheduling the work item options based on cost of delay. Here's a generalised profile of cost of delay and urgency that (roughly) covers all the profiles we have discussed within the precision we could reasonably expect from such a workshop.
Using this profile we can ask for 3 parameters that give enough detail for us to schedule the items. There are 2 dates (t1 and t2) and the slope of the CoD line (or urgency). Before t1 there is low or zero CoD - it's the "CoD low until date" (CLUD. After t2 there is also low or zero CoD - it's the "CoD low after date" (CLAD).

Armed with this information about CoD and urgency profiles, we can now move forward to consider the WSJF method itself. To use it we need information about the urgency, the urgency profile and the duration will be taken by implementation of the work item.

This is considered in the next blog in this series.

Read part 3 now: How to calculate WSJF

Back to part 1: Understanding Cost of Delay and its Use in Kanban

Friday, April 15, 2016

Understanding Cost of Delay and its Use in Kanban

Cost of Delay (CoD) is a vital concept to understand in product development. It should be the guide to the ordering of work items, even if - as is often the case - estimating what it will be is difficult. Cost of Delay is important because it focuses on the business value of work items and how that value changes over time. An understanding of Cost of Delay is essential if you want to maximise the flow of value to your customers.

Don Reinertsen in his book Flow [1] has shown that, if you want to deliver the maximum business value with a given size team, you give the highest priority, not to the most valuable work items in your "pool of ideas," not even to the most urgent items (those whose business value decays at the fastest rate), nor to your smallest items. Rather you should prioritise those items with the highest value of urgency (or CoD per week) divided by the time taken to implement them. Reinertsen called this approach Weighted Shortest Job First or WSJF (sometimes pronounced wizjiff!).

In this series of articles, of which this is the first, we return to the topic of Cost of Delay (previously addressed 3 years ago in Selecting Backlog Items By Cost of Delay), and how CoD can be applied in Kanban. I'll explain the terminology used in the recently published book Essential Kanban Condensed [2] - including why this differs slightly from that used by some other authors - and how you can apply this knowledge in Kanban, potentially combining it with the use of Classes of Service.

Here are the links to the articles in this series:

Part 1: Understanding Cost of Delay and its Use in Kanban (this article)
Part 2: Cost of Delay Profiles
Part 3: How to Calculate WSJF
Part 4: WSJF - Should you divide by Lead Time or by "Size"?
Part 5: Others may follow...
Let's start with some definitions, by looking at a particular work item, a proposal for a new feature in a software product. Let's assume that we've already carried out some analysis of this feature and the competitive market in which the product operates. As a result we can forecast the cash flow - in and out - that will result from the implementation and exploitation of the feature.

Here's what the cash flow looks like...

To know what the Cost of Delay is for this feature we need to estimate what the cash flow would be if we delayed starting this work and instead started in say 10 weeks or 20 weeks time. Here's a comparison of these 3 different cash flows, with no delay, 10 weeks delay and 20 weeks delay.
The analysis seems to be forecasting that not only will the peak revenue be lower by entering the market later, the time period for exploiting the feature profitably is also shorter. To see the effect of this on the overall value of the feature, it is useful to plot a cumulative value, see below...

Now we can see what the value of this feature is if it is implemented without delay - about $420K. We can also see the loss of value - the Cost of Delay - for a 10-week and a 20-week delay.

The next step is to plot the Cost of Delay against the length of the delay. This graph is often referred to as the CoD profile. There are a number of archetypes that different authors have identified [4, 5] that can help us identify the likely profile in given scenarios. We'll look at these in more detail in Part 2 of this series. Here's the CoD profile for our feature:

This shows our feature is losing value most rapidly right now! As value is lost so the rate at which value is lost is also diminishing. At a certain point the projected revenue from the feature becomes less than the development cost so there is no value in implementing the feature and no further Cost of Delay.

We refer the rate at which value is lost as Urgency (the first derivative of Cost of Delay), but other authors use Cost of Delay Per Week or (unfortunately in my view) sometimes just Cost of Delay. It is important therefore when reviewing materials on CoD to clarify whether the term is measured in currency (e.g. $) or currency per length of delay (e.g. $ per week). Here is the plot for Urgency (CoD per week) for our example:

We can see from this graph that Urgency is diminishing in this case as the market opportunity is also disappearing. Reinertsen and Preston Smith [3] noted that the sense of urgency in organisations often runs in the opposite direction to the market opportunity - they named it the Urgency Paradox, the "cruel tendency" for this sense of urgency in product development to be highest when the real urgency, as reflected by market opportunity, is lowest and vice versa.

We will see in future articles in this series how different kinds of work item have different CoD and Urgency profiles, and how we can use this and WSJF to help the scheduling of work to maximise the delivery of business value. We'll also examine the degree to which a quantified approach (using numerical estimates of business value and its rate of decay) can be used in practice, and whether alternative approaches such as the use of CoD profile archetypes can be as or more effective.

Now read part 2: Cost of Delay Profiles


[1] Donald G. Reinertsen. The Principles of Product Development Flow, (United States: Celeritas Publishing. 2009)

[2] David J. Anderson and Andy Carmichael, Essential Kanban Condensed. (United States: Lean Kanban University Press. 2016)

[3] Preston G. Smith and Donald G. Reinertsen. Developing Products in Half the Time. (United States: John Wiley and Sons. 1998)

[4] David J. Anderson. Kanban: Successful Evolutionary Change for Your Technology Business (United States: Blue Hole Press, 2010)

[5] Joshua Arnold and Özlem Yüce. “Using Cost of Delay: Experience Report – Maersk Line.” Black Swan Farming. (2013)

[6] Ian Carroll. “No Correlation Between Estimated Size and Actual Time Taken.” (2016)

[7] Joshua Arnold. "Qualitative Cost of Delay." Black Swan Farming. (2016)

Thursday, March 17, 2016

Kanban's Survivability Agenda and Antifragility

A conversation on the kanbandev online forum has triggered this post. The discussion concerns how evolutionary change is applied, particularly when the fitness landscape is changing to such a degree that large rather than small steps are needed to survive in the new competitive environment. It got me thinking that we must consider evolutionary change on more than one level if we want to address what the Kanban Method calls its Survivability Agenda.

The first mystery to consider is how evolution jumps across valleys in the fitness landscape. Seems to me there are 3 possibilities. You could make large leaps in what you think are promising directions. Doesn't sound a great idea because you're doing reasonably well as you are (different if you know you face an imminent existential threat, but it has the same likely outcome). You could wait for the peak you're climbing to decline in the fitness landscape, to the stage when small steps will move you off it. That's probably going to be too late. Or you rely on diversity. Your peak may be declining and you may - if trends continue - be doomed, but others are in better spaces and they will grow.

The final option sounds like disaster. But I think it is the way evolution works. Processes and technologies evolve much faster than biological organisms (see Eric Beinhocker's Origin of Wealth for more discussion of this) because the cycles of copying with differences, selection, amplification/damping are much shorter. Not only that, they are accelerating, which is what is now so threatening to large organisations. Does this mean large organizations must sit back and let the inevitable happen? Of course not. The key is to have multiple fragile parts, so the organization itself is more antifragile.

In Antifragility N N Taleb discusses how hierarchies can gain antifragility by allowing fragility within them. And also how natural antifragility can be irresponsibly eroded, if higher structures in the hierarchies (like governments) absorb the fragility of structures within them (like banks). Back to the Dinosaurs - they were antifragile as a species (genus? - I don't know; biologists please excuse) to most changes in the fitness landscape less than massive climate change. But since that was the limit of their antifragility they died out. But the higher level in the hierarchy (life on earth) survived (just) because there were some funny rat-like creatures running around scratching a living beneath the dinosaurs feet. They found themselves in the foothills of some pretty small peaks of the fitness landscape, and made the most of it.

So the levels in the hierarchy of Kanban (e.g. Personal, Team, Product, Portfolio) and its stress on the exploitation of real options, are keys to its "Survivability Agenda".* Portfolio management is key. It is where antifragility of the organisation can be built or lost. Portfolio Management must decide what level of investment different products and product ideas receive, and for how long before the return must be tangible. In a stable fitness landscape they might consider that the one successful product they have, should receive all the investment. This builds a monoculture which is vulnerable to shifts in the landscape. Keeping options has a cost but preserves the antifragility at the higher scale. Diversity within the organization and a culture which encourages innovation, learning and experimenting will build greater survivability. Note that in part this is because it tolerates and encourages more fragile technologies and processes within it. They are limited in their ability to survive - indeed they need to maintain the differences from more successful instances, precisely so that diversity is preserved. Eric Bienhocker has an excellent account of Microsoft's use of options when developing Windows. They also had teams investing in OS/2, Apple and Unix. Clearly it would not have been helpful if the Unix team say, thought the OS/2 option was better and started working on that instead of Unix.

In summary, I don't think Kanban provides any magic bullets here. Hopefully it exposes the issues in building resilient or antifragile organisations but it is down to the strategists, managers and leaders within these organisations as to how the tools and insights might be applied. Different groups make different choices. There is no recipe. That in my opinion why it remains one of the most interesting and important methods around.


Wednesday, August 05, 2015

What is Flow Debt?

+Daniel Vacanti's excellent treatise on Actionable Agile Metrics [vaca] introduces a term that may be unfamiliar, even to those with an interest and experience in managing flow systems. The term is Flow Debt - for a definition and explanation read on.

My own particular interest in flow systems is the management of agile software development teams, usually using some variant of Scrum and/or Kanban, and other agile practices such as test-driven (or automated-test intensive) build-test-deploy processes. However the discussion is relevant in many other domains, such as one I've recently been involved in discussing, the flow of patients through diagnosis, treatment and convalescence in healthcare systems.

In managing these systems we need ways to look at the mass of data that emerges from them to focus on the useful information rather than the noise; information in particular that indicates when intervention is appropriate to improve flow, and when the attempt would be as futile as trying to smooth the waves on an ocean. Flow systems in knowledge work contain variability. That variability, within certain bounds (much wider bounds than in manufacturing for example), is desirable to allow innovation, responsiveness and minimising wasteful planning activities.

In this context Flow Debt is a measure that provides a view of what is happening inside our system. This is in contrast with other important measures such as Throughput (Th) and the time an item stays in the process (I call this "Time in Process", TiP [macc], though other terms may be used). These measures provide information only after items have left the system, which may be too late to avoid problems accumulating.

Having Flow Debt roughly translates as: delivering more quickly now at the cost of slower times later. It is calculated by comparing the time since the number of arrivals into the systems was equal to the current number of deliveries with the average time in the process for the most recent deliveries. It is easiest to visualise this on a Cumulative Flow Diagram.
At the point highlighted in the the diagram it is a little over 2 weeks since the cumulative number of items entering the system equalled the cumulative number of deliveries on that date. If the items were delivered in the precise order they arrived, and if all the items were delivered (neither assumption is true!), then we would be able to say that the time the last item spent in the process was also a little over 2 weeks. Furthermore if arrivals and deliveries were smooth over the period, the Average Time in Process for the items would also be this same time.

What was the actual Average Time in Process though? Well you can't read this off the diagram. You have to look at the average TiP for the items delivered in the recent period. Each one has a known TiP, so take the average of them. Exactly how long the period you select for this average is up to you - a day or a week seems reasonable. The shorter the period you take the more noise there will be in the signal. Take too long a period though and there is insufficient time to act on the information. 

With this information we can calculate Flow Debt using Dan's method[vaca]:
Flow Debt = (Time since number of arrivals equalled deliveries) - Average TiP
If you plot this quantity for the data above you get a graph like this. Note I've reversed the sign on this graph to show Flow Debt as negative.
The plot of Flow Debt in this case is quite normal showing a fluctuation around zero and maxima and minima of around the value of the average TiP for the whole period. If you plotted the same data with a monthly average, most of this fluctuation would disappear. I certainly wouldn't want managers rushing down to this team to radically change their process!

There is one point highlighted which is interesting, where the Flow Debt goes from highest debt to highest credit in a few days. What do you think is going on here? Well, if you go back to the informal definition of Flow Debt (delivering more quickly now at the cost of slower times later), we should surmise that before this point the delivered items had been in the process for only a short time. Those delivered at or after this point had a longer time in the process. That's exactly what happened, as the Control Chart below shows.
Another useful indicator here is the "average age" of the work in progress. Here is the plot of that and you can see the significant drop in this metric at the same point.
Just by way of balance let's look at another data set of a team delivering software much less frequently, where their work in progress is increasing over the process, and where the items are not being delivered in age order. All these factors are likely to effect the efficiency and predictability of the flow system... and this is borne out by their plot of Flow Debt.
Seeing a plot like this is a indication to management (and flow management specialists in particular) to take a much closer look at the process being used here.


Wednesday, July 29, 2015

Beyond Control Charts and Cumulative Flow Diagrams

Control Charts (CCs) and Cumulative Flow Diagrams (CFDs) are powerful ways to display information about a flow system, such as a Scrum or Kanban development process. Unfortunately the very fact that the charts display so much information means that it is often difficult to extract specific information from them. That is why it's useful to also plot some of the key attributes of the systems on their own - this allows us to look at these aspects specifically, alongside the rawer view of the data that you get from CCs and CFDs.

The graphic on the right shows a number of diagrams all of which were derived from very simple data about each item that flowed through this system:
  • when it arrived into the system; 
  • when it departed the system; and
  • whether the item was "delivered" or "discarded".
Note: I use the term "discard" here as a general term to include an exit from the system at any point in the system and for any reason. It includes aborting/abandoning the item after commitment, as well as postponing the item by moving it back to a part of the process upstream from the system under study. For the definition of this and other terms used here please see this Glossary.
The first diagrams in the graphic is the Control Chart - actually this is simply a scatter plot of the time each item stays in the system under study. I refer to this as "Time in Process - TiP - or alternatively "Time in _______" where the blank stands for whatever the process or part of the process is under study. For example it could be the Time in Preparation, Time in Development, Time in Acceptance, etc. The scatter plot highlights (in orange) the items which were not "delivered".

Below it is the CFD. Unlike some very stripy versions, this one has only 3 bands (as limited by the input data), corresponding to arrivals, all departures (including discards), and deliveries.

The remaining diagrams all highlight one or more aspects of the same data. Firstly the terms from Little's Law:
  1. Average Delivery Rate. This is measured in items per week, and the average is taken over 1 week. Note this only shows actually delivered items. Alternatively a plot of "Throughput" could have been used which includes all items that have passed through the system.
  2. Average Time in Process (TiP). This is measured in weeks and again the average is taken over 1 week.
  3. Average Work in Progress (WiP). This is measured in number of items, again averaged over one week. Care must be taken when calculating average WiP for a day, particularly on days when an item arrives in or departs from the system, to ensure that it is consistent with the calculations of average TiP.
In addition to these standard quantities from Little's Law a number of flow balance metrics are shown. These are:
  1. Net Flow. Simply the difference between the number arriving and departing over the previous week.
  2. Delivery Bias. This is a measure of the degree to which Delivery Rate is higher or lower than would be predicted by Little's Law for the given period (1 week in this case). If it is non-zero it indicates away from stability. Further discussion of this quantity is found here.
  3. Flow Debt/Credit. This is a measure of the degree to which the average TiP varies from that predicted by the CFD. This also indicates a degree of instability if it varies significantly from zero. See Dan Vacanti's book [vaca] for further discussion.
  4. Age of WiP Indicator. This compares the average age of the WiP with half the average Tip. It is another indicator of imbalance.
Recently I have been discussing these four quantities with colleagues and with Troy Magennis and Dan Vacanti as they show promise for predicting significant changes in the TiP, a very important aspect of the effectiveness of the system.

A spreadsheet containing the means to generate these diagrams from your data will shortly be made available from gitHub. Watch this space!

  • [vaca] Vacanti, Daniel S. "Actionable Agile Metrics for Predictability: An Introduction". LeanPub. (2015)