Never leave a failing test

September 2012

Imagine this: you're taking a guided tour of a nuclear power station. Just above the door as you come in there there are five lights marked Key Safety Indicators. One of the lights is flashing red.

“What’s that flashing red light?” you nervously ask your host.

“Oh, that light does that from time to time. We’re not sure why; we just ignore it.”

There’s an awkward silence. How confident are you feeling right now?

Failing tests fester.

Red tests are like code rot. Catch it early and sort them out, and you’ll be fine. If you don’t, they’ll spread through your code like a disease, causing all sorts of damage:

  • Failures cause fear of change. If we don’t understand why a test is failing, we don’t understand the code base. If we don’t understand our code, we can’t change it safely. All bets are off: any change we make will cause us to be that little bit more anxious.

  • Failures breed failures. If one test continually fails, then other coders are more likely to tolerate failing tests, and the number of failing tests will grow quickly.

  • Failures kill urgency. There’s a scene in a well-known heist movie where a team of thieves has to break into a bank. Their strategy revolves around putting a remote-controlled car under a waste bin: they use this to cause the bin to move at night, setting off all the alarm sensors. The first time the alarm goes off, the place is filled with police in a matter of seconds. The fifth time the alarm goes off, only one squad car with two bored officers turn up, totally unprepared for the waiting thieves who quickly overpower them. The same is true with tests: if they fail all the time, developers will take a cavalier attitude to checking out the cause. This could cause a really serious failure to be missed.

The only point at which failing tests are valid is when you’ve written them just before the code you plan to add. If the test should be failing, write code to make it work. If the test shouldn’t be failing, change it or delete it. Never leave it to fester.

Share


More articles

Why video game coders don't use TDD, and why it matters

NES test station

Whilst working on Sol Trader, I’ve written many unit tests for my code. Many of these tests have been written before the code itself, using a practice called Test-driven Development (TDD).

Test-driven development is the practice of writing a failing test in order to specify the behaviour of a piece of code, then writing the code to satisfy the tests afterwards. We then refactor and improve our code from there.

In most programming environments, people are talking about TDD and trying to practice it. It’s even become an essential bullet point on job adverts, as if not practicing TDD makes us fundamentally worse programmers (which isn’t true, by the way.) TDD seems to be everywhere.

Everywhere, that is, except the games industry.

Why is this? Is it because TDD is flawed in some way, or simply not applicable here, or that practices have grown up to counter the need for TDD?

The benefits of TDD

Let’s begin to answer this by looking at the specific advantages TDD gives us:

  • It forces usage-first coding. TDD represents another client for our code, independent from our production code. It asks specific questions of the codebase to ensure that it’s correct. It forces us to think about our code from the point of view of ‘what it does’ first, rather than ‘how it works’. This can often lead to surprising realisations about the code we actually need, and prevents us from writing spurious code we might think we need, but actually represents wasted effort.

  • It helps us minimise code size and complexity. If we adhere strictly to the principle of only writing enough code to satisfy the test, then our tests should capture every possible path through our code. Additionally we only have enough code to satisfy the exact problem we’ve used tests to define - this is important because code is a liability, not an asset. The same is true of code complexity. If we find ourselves writing reams of tests to satisfy a particular piece of code, that code is too complicated or very risky and a prime source of bugs.

  • It provides design feedback. Well designed code is easy to test. Therefore, when initial tests become harder to write, that might be because our code isn’t well designed, or well understood. Typically, pure functions are easier to test and reason about as they have no side effects: testing these functions is very easy and therefore our code tends to gravitate towards them.

  • It allows for verification over time. I’ve listed this last, as I don’t see it as the most important benefit of TDD. As I refactor my code, tests become simple enough to be self evident and the tests can be safely deleted. At the system level, ensuring that a complicated system continues to work after significant change is useful, but considerable effort is required to write tests that check an entire system safely. Poorly written tests give a sense of false confidence to those new to the practice, which can be highly dangerous - our tests can start lying to us. In practice, a few end-to-end tests to verify basic functionality of a module are worth the effort, but many more can slow development and provide false confidence.

How games industry experts verify their code

Let’s look at expert coders in the games industry and discover what they do to gain these advantages. Here are some observations:

  • They write the production usage code first. Casey Muratori on Handmade Hero writes his code from the point of view of the usage of the code first, by writing exactly the calling code before defining structures and basic methods. This gains many of the advantages of TDD, using production code rather like tests to discover what the code should do. By implementing a client for the code first, we discover what the code should look like before we write it.

  • They fail fast using assertions. Assertions are conditions in the code that are often only present in developmental builds, causing an artificial exception or crash when the condition is not true. They ensure that the running program is in a good set of known states: code running in the wrong state is a hugely common cause of bugs. As unit tests only check a certain set of known states from the outside, assertions are useful in catching unexpected behaviour that wasn’t initially thought of. As the code tends toward purer functions with less possible states, the usefulness of assertions within that section of code diminishes. They also do not provide design feedback on the code in the way that TDD does.

  • They rely on static compilation to catch type errors. Static compilation is a form of testing. If we are thinking carefully about the distinct types we are using, avoiding Primitive Obsession, then this distinction between types will help ensure that we aren’t passing the wrong things to the wrong functions, or confusing distinct concepts in our code.

  • They use automated testing where the code is risky. John Carmack recently wrote about the value of testing in his essay on functional code:

"Whenever I come across a finicky looking bit of code now, I split it out into a separate pure function and write tests for it. Frighteningly, I often find something wrong in these cases, which means I'm probably not casting a wide enough net."-- John Carmack

What are games developers missing?

Games developers have a number of techniques that give them similar benefits to TDD. We see that by writing usage code first, developers get good feedback on their design as they go. Code verification over time is taken care of through judicious use of assertions and using automated tests with risky code.

The area that games developers miss out by not using TDD is in the reduction of code size and complexity. However, in high performance computing, the size of the compiled system and the branching complexity are constant concerns. There’s a real performance penalty through having too much code, breaking branch prediction and accessing memory too often by jumping the execution path all over the place. The fastest and most efficient code boils down to data transformation as functionally as is possible within the obvious constraints of the gaming environment.

If all of this is taken into account, games developers have side-stepped the need for TDD.

However, there are bad reasons to dismiss TDD in games. There’s a perception that games are too ‘emergent’ and complex to apply TDD to. This is false. Games are more deterministic than people think, especially in the inner workings of the code. Moving to a more functional programming style makes this explicit, although often enough so much risk is removed from the code that TDD’s design feedback is less useful.

There are clearly areas in games development where TDD is the wrong approach - games are about ‘feel’ and the ‘experience’, and we can’t test-drive ‘fun’, or test the output of complex interactions of hundreds of entities. Sometimes however TDD is dismissed because we cannot imagine how we might begin to test our code: this says more about the quality of our code than the merits of TDD as a practice.

Summary

Video games devs don’t do TDD for two reasons:

  • The good reason: the best practices in the industry deliver many of the same benefits as TDD.

  • The bad reason: an insufficient knowledge of TDD and good code design can lead people to believe it’s just not relevant to games. The smokescreen of “we cannot TDD fun” can mask a poor understanding of good coding architecture.

In practice, I attempt to TDD much of my low level code, especially my functional core code which is simply transforming data from one type to another. I use TDD where I’m weakest as a programmer: reasoning about pointer and bitfield arithmetic aren’t my strong points and therefore I like to test-drive it!

I don’t use TDD at all for UI testing and where the ‘feel’ of something is important, and for self-evident code.

TDD has helped to teach me about good code design, side effects, the perils of state, architecture, programming in a functional style and the evils of prevalent inheritance-based object-oriented approaches. Perhaps the real value is not in the continued practice, but in the lessons that it teaches?

Read more

That's not BDD, that's just Cucumber

Continuing in the vein of “concept and values vs concrete tools” (see my previous post about dependency injection), I’d like to highlight a common fallacy about Behaviour-driven Development (BDD) and Cucumber, and BDD and story-writing; namely, that they’re all the same thing.

BDD is a set of concepts and values, and Cucumber is one of many tools which we can use to work with those values. Using a tool such as Cucumber, or following a practice such as feature-writing does not mean that you’ve internalised the values of BDD yet or understand what it really means.

Before I get into that, let me clearly explain the distinctions in my mind between the different terms.

Concepts, practices and tools

A concept or value is the higher level idea or principle we are attempting to espouse or instil. For example:

“Code only behaviour that has value the customer can see.”

“Write software that matters; avoid software that doesn’t.”

A practice is a way of expressing that concept: for example, they may take the form of guidelines about how to write features in a certain way, or the exhortation to use acceptance tests alongside other automated tests.

The tools are the different software programs we use to execute these practices. They are many and varied: popular BDD tools include Cucumber, RSpec, SpecFlow and others.

These distinctions are essential to prevent useless arguments about the relevant importance of practices, and even more useless arguments about tools.

The concept should outlive the practice and tools

Test-driven development (TDD) is a good example of a series of concepts that has outgrown the tool and the practices that were originally associated with it. Most people don’t think of JUnit when they think of TDD, but the first TDD implementations used it extensively. The concept (test-driven coding) has transcended the tool (JUnit + Java).

TDD is also universally introduced using a form of practice called the “TDD cycle.” We are encouraged to write tests, then write code, then refactor. However, as the coder becomes more familar with this cycle and follows it instinctively, TDD becomes much more about design than about “Red, Green, Refactor.” The coder outgrows the practice (although they may never abandon it entirely) and becomes intimately associated with the concept. This concept in TDD’s case can be quite difficult to describe, but might be partially summarised as clean, reliable code.

Scrum is a series of practices and tools used to illustrate agile concepts. Unfortunately, unlike with TDD, many who practice Scrum have never got past these practices to the principles behind them. Some people view Scrum as the Standup, or the Sprint, or perhaps the Backlog. The important concepts of team synchronisation, regular cadence, and progressive iteration can be lost in the noise.

If the concepts cannot outgrow the tools and practices we use to express them, then the concepts are weak, or the tools and practices are weak (or weakly understood.) Further, if we cannot envisage discarding a practice or tool, perhaps we haven’t fully grasped the concepts behind it yet.

We still need the practices

That’s not to say we can internalise concepts without good practices. We can’t just bleat “write clean, reliable code!” at someone and expect them to know what clean, reliable code truly is, and how to continue to write it when it’s difficult to do so. Without the understanding that comes from diligently applying “Red, Green, Refactor” over a long period, we will never gain full insight into the values behind TDD. I’ve been applying TDD practices for several years now, and I am still learning about the relationships between objects and how they can be improved.

In the same way, good tools more suited to the practices we are trying to use will help us internalise concepts and values more quickly. A good example is the way that RSpec changed the language we use when writing unit tests, to help us to focus on behaviour rather than just correctness.

BDD is a series of values and concepts, not practices or tools

Given the above, let’s consider the fundamental difference between BDD (the set of concepts) and Cucumber (the tool) or feature-writing (the practice).

BDD is the formalisation of the best of the underlying values and concepts discovered and propagated by TDD. It further formalises and extends the basic TDD practices and fuses them with other practices to help team communication. This spawned a number of new tools to aid in this approach. Particularly, new tools were needed to help non-technical people read and understand acceptance tests, although the old tools could still be used and many still continue to do so.

Tools and practices are the most visible side effect of a series of concepts, and bandwagon jumping is always a danger. Due to the popularity of the tools, BDD can be unfortunately conflated with the principal tools used to drive it. We should work hard to explain that this is not the case.

For example, I’ve heard some people talk about “writing the BDD, then writing the code” - reducing “BDD” to the tool (Cucumber) and the practice (feature-writing) rather than the fundamental concepts which give rise to the practice. To do so is to make the mistake that many do when learning Scrum, to miss the values by blindly using the tools and following the practices by rote.

A similar problem is the idea that BDD is simply the “Given When Then” approach to writing stories. That approach is a practice we use to clearly express the concept of communicating requirements, and valuing that communication process highly. The approach is not the value in itself.

I think part of the reason we have a tendency to do this is because internalisation of concepts is desirable, but hard to do, and we’re seeking a quick road to success. We think “if we’re using Cucumber, we’re doing BDD,” or “if we’re writing stories, we’re being agile.” Sadly, this isn’t true, and although I can understand the motivation there are no shortcuts to internalising the concepts. We need to carry out the practice, using the best available tools, whilst considering the values carefully - that is the long road to mastery.

In summary

When training BDD we communicate the values and concepts to our trainees, demonstrate the practices and tools, and help them to try them out. We teach the usage of the tools, and the correct way to complete the practices, referring back to the concepts as appropriate. This way, when trainees are on their own they’re able to head in the right direction, and will internalise the values in such a way as to be able to shed the initial tools (and even some of the practices) as they improve.

When learning something new, try to seperate the values and concepts you are trying to internalise from the practices and tools that you are using to do this. Carry out the practices whilst considering the concepts, and always inquire of yourself “why am I doing this now?” and “what am I learning?” For myself, I’m currently attempting to apply the practice of immutability to my code, in an attempt to internalise more functional programming concepts. It’s early days, but it’s leading to interesting results.

We’re teaching this stuff

In case you hadn’t noticed, a few of us are starting to teach BDD in person at the moment. Our next courses are in Brussels and Edinburgh; instead of flailing about with the tools or hesitantly attempting the odd practice, come and learn what BDD is really all about.

Read more

Scenarios are not Acceptance Criteria

"That's all very well, but how do I know that it works?"

"What will that actually look like on screen?"

It can be hard to nail down a feature file. Some people like to argue over the wording of the preamble and jump into the scenario writing (much) later. Some prefer to get on with writing concrete examples to help jumpstart their thinking, and frame the story with the acceptance criteria later.

"So what's the point of this feature again?"

Whichever way we approach writing our feature files, it’s important that we iterate over our wording. Let’s not neglect either our acceptance criteria, or scenarios detailing concrete example behaviour. Without both, we’re making it hard for our developers to implement a feature, and making it hard for us to understand its purpose a few months down the line.

"Can you give me an example of that?"

It’s very easy to conflate the concept of scenarios with acceptance criteria: they aren’t the same thing. Scenarios are concrete examples of acceptance critera: they help flesh out and explore complex criteria, and ground them in reality. Without concrete examples it can be hard to get a handle on where to start when implementing a feature, and it’s difficult to wrap our minds around what needs to be done.

Lack of acceptance criteria: hesitation and confusion

Here’s a feature without acceptance criteria:

    Feature: Relating two people

    Scenario:
      Given a person called Joe
      And a person called Bob
      When I set Joe to be the father of Bob
      Then their relationship is recorded in the system

When we skip the acceptance criteria and jump straight into examples, we lose context. It’s hard to see how and why this feature exists, and who is using it?

Example scenarios aren’t good at describing design and user experience constraints on a feature. Developers will be tempted to rush straight through the implementation without paying attention to the detail. They’re also no good at communicating the need for other edge cases. Is there something else that we’ve missed here? What about distinguising between biological and adoptive parents, for instance? Or checks for age to ensure the father could be old enough to have children?

Lack of concrete example scenarios: haziness and obfuscation

We might be tempted to shoe-horn all that information into the scenario:

    Scenario: Relating two people
      Given a father and two children
      When I relate them either at adoptive or biological parents
      Then the relationship should be recorded
      And the sibling relationships should be worked out
      And we are warned if the father appears too young to have children
      But only if the relationship is biological

This isn’t a real scenario any more. We’re trying to describe several different things in one place. It could be implemented as several different scenarios joined together, but by itself the lack of concreteness means that we can’t easily reason about it, and it’s also nigh on impossible to automate without skipping some of the intent. Using ‘Given’, ‘When’ and ‘Then’ does not automatically make something a concrete example - all this information belongs in the preamble.

Combining acceptance criteria with real examples

Let’s try and combine both these techniques:

    Feature: Relating two people
      As Robert the royal historian, I want to show parent/child relationships
      in my family history system so that I can track royal lineage over many centuries.

      * I should be able to relate people as parent and child very simply and quickly
      * Sibling relationships can be automatically worked out
      * Each person can have biological parents, and adoptive parents
      * Ensure we warn our historian if the father is too young

    Scenario: Relate Joe and Bob as father and son

    Scenario: Bob and Elaine are siblings as Joe is father of both
      Given a person called Joe
      And a person called Bob
      And a person called Elaine
      When I set Joe to be the father of Bob
      And I set Joe to be the father of Elaine
      Then Bob and Elaine are shown as brother and sister

    Scenario: Bill is the adopted father of Elaine and Bob
    Scenario: Charlie is too young to be the father of Joe

Have a look at the acceptance criteria as listed in the preamble. They state both the reason for the story and they flesh out some more of the thinking. You can often leave the feature like this up until the point I want to work it, with criteria in bullet form. If the feature is complex and there’s a danger information will be lost, I’d recommend writing down examples during the planning of the story in order to properly capture the behaviour (like I’ve done here with the second scenario), but you don’t need to do this for every scenario until you come to automate it.

Summary

Think back on what you have just read. This post would have been hard to understand without the two examples above. Without concrete examples, it’s very easy to gloss over content.

Alternatively, if this post had just consisted of the two features above, followed by “Don’t do this! Any comments?”, our natural reaction would have been one of confusion. Don’t do what exactly? And what exactly should we do instead?

Just like a blog post without an example, or a teaching workshop without a practical element, if there’s no concrete example then acceptance criteria can lead to wishy-washing thinking. Similarly, if we just sit down and start working on something concrete without any clear context, we’ll struggle to see the reasons for doing it and we’ll miss edge cases. When you have both, that’s when you know you’ll understand.

Personally, I tend to the second error: because I can read code, I sometimes fall into the trap of not making my examples concrete enough. Which of these two do you more tend towards?

Postscript

For more, see Liz Keogh’s post on this topic from last year. For a slightly different point of view, check out Antony Marcano’s thoughts on scenario oriented acceptance criteria. Antony argues for using scenario titles as our list of criteria. I find it helpful to keep Scenario titles and Acceptance Criteria separate, as I don’t think there is always a clear mapping between the two. One is an evolution of the other, and it’s useful even when the scenario titles are written to keep the Acceptance Criteria around for context. What do you think?

Read more

Cucumber: the integration testing trap

“Why don’t people read my Cucumber features?”

It’s an often heard refrain, and it can feel frustrating for developers. We carefully craft features that make sense to us, and are reasonably easy for us to understand. We post them over to our product manager hopefully, but a glazed look comes over their face as they read them, and they seem only to read the first half before becomes distracted, and mumble that they “look fine”, before moving on to something else. As developers, what can we do about this?

There can be many reasons this might happen, but one of them is that we could be writing our features to ensure our code is correct, rather than ensuring that it’s suitable. Perhaps we’re using Cucumber to writing integration tests, not acceptance tests.

Acceptance tests != Integration tests

What’s the difference between the two? Both types of test sometimes look similar in code, but they are written from completely different points of view.

Integration tests test the objects in our system work correctly together. Where unit tests check the the messages objects are sending and receiving are correct, integration tests check that the messages match up and the objects are playing nicely together.

It’s useful to have a few integration tests at points in our codebases where the object interactions are critical. A little too much testing at this level will lead to slow test code, and we’ll never be able to cover every eventuality - see my post and particularly J.B. Rainsberger’s excellent posts on integration testing.

Some people use the terms for integration and acceptance tests interchangeably. They may be written similarly, but integration tests are not the same thing as acceptance tests. They are still written for the developer’s benefit. They are still ensuring that we’re building the thing right, not ensuring that we’re building the right thing.

Acceptance tests are a whole different ball game. When writing them, our tests are focused on the customer and on what they want built, rather than ensuring our own code fits together well. They’re oriented entirely around what the customer sees, not what we see. As developers, they’re not actually for us at all.

We probably need both types of test in our system. Many developers, however, though diligently writing integration tests, have never written an acceptance test in their life. By conflating the two ideas, we’re missing the point: in order to do BDD properly there has to be a level of testing that isn’t about us, but is about our customer.

Cucumber works best when the step code is oriented around the acceptance of the feature, rather than whether a feature’s code is correct. The difference is subtle but important. If we’re thinking of integration tests during our feature writing, then we’ll write our features in that fashion. Our steps will constantly need to be modified to fit our notion of what needs to be tested, which is why our customers tend to glaze over when they read them. The features will tend toward greater detail, as we’re testing correctness not suitability, and there will probably be lots of them. It’s difficult for a customer to keep up with these types of tests, and it’s not surprising they lose interest.

Sit down with your customer

I’m generally happy with developers drafting features and then bringing them to a discussion with their team to refine them and nail down exactly what’s needed. I don’t recommend this when you’re starting out, though, and if you’re finding that your customer isn’t even reading your features properly, then something is seriously wrong.

If this is happening, I suggest you take the ideas back to first principles and sit down with your customer to write a few feature files out before you start on the next piece of work. You could try and get them to suggest the wording for the first feature. They might attempt to suggest wording that would be more in keeping with a developer mindset, and struggle in the process. That’s exactly what Cucumber was created to avoid.

Ask them to describe the feature in their own words, and work together to get something down on paper which makes sense to both of you. If we remember that this feature isn’t for us to test our system’s correctness, but a blueprint to guide our development direction, then there should be no conflict. Try to adopt their words for the different concepts in your system, rather than defaulting to your own pat terms (perhaps shoppers, not users, for example.) If in doubt, defer to the customer’s wording: don’t try and impose you’re own structure except for ensuring the bare minimum to get the feature to run.

I’ve often said before that if no one is reading our features, we’re better off using RSpec. My thinking has evolved: perhaps many people miss the point of the outer part of the BDD cycle entirely - the tests are about the customer, not us. If we’re only using Cucumber for integration testing, we are better off using RSpec. Whatever tool we use, we need to make sure it’s giving value to customers, not layering on integration tests for our own benefit.

If you like what you read, and you'd like to learn more, a quick reminder that Matt Wynne and I are running BDD Kickstart in London this coming December.
Read more

The power of feedback

"Everyone has a story that makes me stronger." -- Richard Simmons

There’s something about feedback. Whether it’s the validation of your latest idea, a hit on your webpage showing up on Google Analytics, or something as simple as a passing test, it’s a valuable and important motivational commodity, which can also shape the direction in which we’re going very precisely.

The effect of feedback is the engine at the root of software techniques as diverse as pairing, TDD, BDD and the Lean Startup movement. Why is feedback so powerful?

Feedback shortens the loop

Any sort of feedback represents the end of a creative loop that started when we began to work on whatever we’re receiving feedback about. The shorter that loop, the more quickly we can respond to change, and the more agile we can be. It also helps us know when we’re done working on something and it’s time to move on.

That’s partly why TDD is so powerful: we receive instant feedback on what we’re working on and we are never more than a few minutes away from a fully working system. It’s also why good quality customer feedback is powerful: we’re never more than a few iterations away from the feature the customer wants.

Feedback validates us and our work

The validation of our work is one of the things that lies at the root of pairing: the constant code review and the camaderie keeps us motivated and working on something longer than we can manage on our own. I’ve found programming on Sol Trader alone to be an enlightening experience - I’ve learnt how important it is to have others working alongside me. I now have a graphics expert reviewing my code, and more design and artistic help to keep me motivated to turn out releases.

It’s also incredibly motivating to receive a “thank you!” or “looks great!” There’s a lot of power in simple encouragement. If we know our work is appreciated and valued, we’ll likely to work longer and with more energy on that next killer feature.

However, there’s a danger in only seeking pure validation, or (worse) coming to rely on it for motivatioW. If we receive too much positive validation, we’ll end up getting proud of ourselves and demotivated to push for excellence, and we’ll get terminally discouraged if we get too little. We should be seeking the kind of feedback that motivates us to shape our work for the better. We have to learn to ask the right questions.

Feedback shapes our work

If we let it, feedback will change the work we do and how we do it. This applies no matter how we receive feedback about our work - the different types of feedback will change our work in different ways, and we must therefore strive to increase both the quality and the variety of the feedback we receive, without falling into the trap of simple validation.

Done right, TDD offers more that just validation of our code; it gives us information about the quality of our code design. It causes us to shape our code differently and more carefully than code written without feedback. We can’t operate in isolation though: TDD without feedback from stakeholders (whether that’s through a technique such as Behaviour Driven Development or some other method) is incomplete: we get feedback that our code works, but nothing on whether it’s the right code.

There’s more: conversations such as Lean Startup are taking the BDD ideas one stage further. Instead of relying on the guesses of the stakeholders to determine what the right features are, how about harnessing feedback from the actual customers using the product? This can be done in various ways, through automatic metrics gathering and tracking experiments rather than features.

It’s my opinion that the Lean Startup conversation is certainly as important as the BDD conversation, and potentially as important as the Agile conversation, as it improves the variety of the feedback we receive on our work.

How are you finding feedback shapes your work? Are you getting the right kinds of feedback from a variety of sources? Or are you settling for pure validation?

Read more