Test Automation Pyramid – review
Test Automation Pyramid
This review has turned out to be a little longer than I expected. I have been using my own layering of the test automation pyramid and the goal was to come back and check it against the current models. I think that my usage is still appropriate but because it is quite specific it wouldn’t work for all teams. So I might pick and choose between the others when needed. If you write line of business web applications which tend to have integration aspects then I think this might be useful. The key difference from the other models is my middle layer in the pyramid is perhaps a little more specific as I try and recapture the idea of integration tests within the context of xUnit tests. Read on if you are interested – there are some bullet points at the end if you are familiar with the different models. Just a disclaimer that reference point is .Net libraries and line-of-business web applications because of my work life at the moment. I am suggesting something slightly different than say this approach but doesn’t seem that different to this recent one and you can also find it implicitly in Continuous Delivery by Farley and Humble.
A little history
The test automation pyramid has been credited to Mike Cohn of Mountain Goat Software. It is a rubric concerning three layers that is foremostly concerned with how we think about going about testing – what types of tests we run and how many of each. While various authors, have modified the labels, it has changed very little and I suspect that is because its simplicity captures people’s imagination. In many ways, it makes a elementary point: automating your manual testing approach is not good enough particularly if that testing was primarily through the GUI. Mike explains that this was the starting point for the pyramid, he says,
The first time I drew it was for a team I wanted to educate about UI testing. They were trying to do all testing through the UI and I wanted to show them how they could avoid that. Perhaps since that was my first use it, that’s the use I’ve stuck with most commonly.
Instead, change the nature of the tests – there is a lot of common sense to this. I’ll take for instance a current web application I am working on. I have a test team that spend a lot of time testing the application through the browser UI. It is a complex insurance application that connects to the company’s main system. Testers are constantly looking at policies and risks and checking data and calculations. In using the UI, however, they have tacit knowledge of what they are looking at and why – they will have knowledge about the policy from the original system and the external calculation engine. It is this tacit knowledge and the knowledge about the context – in this case, a specific record (policy) that meets a criteria – that is difficult to automate in its raw form. Yet, each time that do a manual test in this way the company immediately looses its investment in this knowledge but allowing it to be in the head of only one person. Automating this knowledge is difficult however. All this knowledge is wrapped into specific examples found manually. The problem when you automate this knowledge is that it is example-first testing and when you are testing against data like this today’s passing test may fail tomorrow. Or even worse the test may fail the day after when you can’t remember a lot about the example! Therefore, the test automation pyramid is mechanism to get away from a reliance on brittle end-to-end testing and the desire to simply automate your manual testing. It moves towards layering your testing – making multiple types of tests, maintaining boundaries, testing relationships across those boundaries and being attentive to the data which flows.
Here’s the three main test automation pyramids out there in the wild. Take a look at these and then I will compare them below in a table and put them alongside my nomenclature.
Test Automation Pyramid: a comparison
Cohn Mezaros Crispin Suggested (manual) (exploratory) UI** System GUI System Service Component Acceptance (api) Integration Unit Unit Unit/Component Unit
** in a recent blog post and subsequent comments Mike agrees with others that the UI layer may be better called the “End-to-End” tests. Mike points out that when he started teaching this 6-7 years ago the term UI was the best way to explain to people the problem. And that problem being that automating manual tests meaning automating GUI tests and that the result is (see Continuous Testing: Building Quality into Your Projects):
Brittle. A small change in the user interface can break many tests. When this is repeated many times over the course of a project, teams simply give up and stop correcting tests every time the user interface changes.
Expensive to write. A quick capture-and-playback approach to recording user interface tests can work, but tests recorded this way are usually the most brittle. Writing a good user interface test that will remain useful and valid takes time.
Time consuming. Tests run through the user interface often take a long time to run. I’ve seen numerous teams with impressive suites of automated user interface tests that take so long to run they cannot be run every night, much less multiple times per day.
Related models to test automation pyramid
Freeman & Price: while not a pyramid explicitly, they build a hierarchy of tests that correspond to some nested feedback loops:
Acceptance: Dos the whole system work?
Integration: Does our code work against the code we can’t change?
Unit: Do our objects do the right thing, are they convenient to work work?
Stephens & Rosenberg: Design Driven Testing: Test Smarter, Not Harder have four level approach in which the tests across the layers increase in granularity and size as you move down to unit tests – and business requirement tests are seen as manual. Although their process of development is closely aligned to the V model of development (see pp.6-8)
four principal test artifacts: unit tests, controller tests, scenario tests, and business requirement tests. As you can see, unit tests are fundamentally rooted in the design/solution/implementation space. They’re written and “owned” by coders. Above these, controller tests are sandwiched between the analysis and design spaces, and help to provide a bridge between the two. Scenario tests belong in the analysis space, and are manual test specs containing step-by-step instructions for the testers to follow, that expand out all sunny-day/rainy-day permutations of a use case. Once you’re comfortable with the process and the organization is more amenable to the idea, we also highly recommend basing “end-to-end” integration tests on the scenario test specs. Finally, business requirement tests are almost always manual test specs; they facilitate the “human sanity check” before a new version of the product is signed off for release into the wild.
The pyramid gets it shape because:
- Large numbers of very small unit tests – set a foundation on simple tests
- Smaller number of functional tests for major components
- Even fewer tests for the entire application & workflow
With a cloud above because:
- there are always some tests that need not or should not be automated
- you could just say that system testing requires manual and automated testing
- need multiple types of tests
- tests should be complimentary
- while there looks like overlap, different layers test at a different abstraction
- it is harder to introduce a layered strategy
So, as Mezaros points out, multiple types of tests aware of boundaries are important:
Compare and Contrast
Nobody disagrees that the foundation are unit tests. Unit tests tell us exactly which class/method is broken. xUnit testing is dominant here as an approach and this is a well documented approach regardless of classical TDD, mocking TDD or fluent-language style BDD (eg should syntax helpers, or given/when/then or given/it/should). Regardless of the specifics, each test tests one thing, is written by the developer, is likely to have a low or no setup/teardown overhead, should run and fail fast and is the closest and most fine grained view of the code under test.
There are some slight differences that may be more than nomenclature. Crispin includes component testing in the base layer whereas Mezaros puts it as the next layer (which he defines the next layer as functional tests of major components). I’m not that sure that in practice they would look different. I suspect Crispin’s component testing would require mocking strategies to inject dependencies and isolate boundaries. Therefore, the unit/component of Crispin suggests to me that the unit test can be on something larger than a “unit” as long as it still meets the boundary condition of being isolated. One question I would have here for Crispin is in the unit tests would you include for instance tests with a database connection? Does this count as a component or an api?
The second layer starts to see some nuances but tends toward telling us which components are at fault. These are tested by explicitly targeting the programatic boundaries of components. Cohn called his services while Crispin, API. Combine that with Mezaros’ component and all are suggesting that there is size of form within the system – a component, a service – is bigger than the unit but smaller than the entire system. For Mezaros, component tests would also test complex business logic directly. For Cohn, the service layer is a the api (or logical layer) in between the user interface and very detailed code. He says, “service-level testing is about testing the services of an application separately from its user interface. So instead of running a dozen or so multiplication test cases through the calculator’s user interface, we instead perform those tests at the service level”. So what I think we see in the second layer is expedient testing in comparison to the UI. Although what I find hard to see is whether or not this level expects dependencies must be running within an environment. For example, Mezaros and Crispin both suggest that we might be running FIT which often require a working environment in place (because of the way they are written). In practice, I find this a source of confusion that is worth considering. (This is what I return to in a clarification of integration testing.)
On to the top layer. This has the least amount of tests and works across the entire application and focuses often on workflow. Cohn and Crispin focus on the UI to try and keep people about keeping these tests small. Mezaros makes a more interesting move to subsume that UI testing into system testing. Inside system tests he also uses acceptance testing (eg FIT) and manual tests. Crispin also account for manual testing in the cloud above the pyramid. Either way manual testing is still an important part of a test automation strategy. For example the use of record-and-replay tools, such as, selenium or WATIR/N have made browser-based testing UI testing easier. You can use these tools to help script browser testing and then you have the choice whether to automate or not. However, UI testing still suffers from entropy and hence is the hardest to maintain over time – particularly if they are dependent data.
Here are some of the issues I find coming up out of these models.
- the test automation pyramid requires a known solution architecture with good boundary definition that many teams haven’t made explicit (particularly through sketching/diagrams)
- existing notions of the “unit” test are not as subtle as an “xUnit” unit test
- people are too willing to accept that some boundaries aren’t stub-able – often pushing unit tests into the service/acceptance layer
- developers are all too willing to see (system) FIT tests as the testers (someone else’s) responsibility
- it is very hard to get story-test driven development going and the expense of FIT quickly outweighs benefits
- we can use xUnit type tests for system layer tests (eg xUnit-based StoryQ or Cucumber)
- you can combine different test runners for the different layers (eg xUnit, Fitnesse and Selenium and getting a good report)
- developers new to TDD and unit testing tend to use test-last strategies that are of the style best suited for system layer tests – and get confused why they see little benefit
Modified definition of the layers
Having worked with these models, I want to propose a slight variation. This variation is in line with Freeman and Pryce.
- Unit – a test that has no dependencies (do our objects do the right thing, are they convenient to work with?)
- Integration – a test that has only one dependency and tests one interaction (usually, does our code work against code that we can’t change?)
- System – a test that has one-or-more dependencies and many interactions (does the whole system work?)
- Exploratory – test that is not part of the regression suite, nor should have traceability requirements
There should be little disagreement about what good xUnit tests: it is well documented although rarely practised in line-of-business applications. As Michael Feathers says,
A test is not a unit test if:
* it talks to the database
* it communicates across the network
* it touches the file system
* it can’t run at the same time as any of your other unit tests
* you have to do special things to your environment (such as editing config files) to run it
Tests that do these things aren’t bad. Often they are worth writing, and they can be written in a unit test harness. However, it is important to keep them separate from true unit tests so that we can run the unit tests quickly whenever we make changes.
So I have a check list unit tests I expect to find in a line-of-business unit test project:
- model validations
- repository validation checking
- action redirects
- any helpers
- services (in the DDD sense)
- service references (web services for non-Visual Studio people)
Integration tests may have a bad name if you agree with JB Rainsberger that integration tests are a scam because you would see integration tests as exhaustive testing. I agree with him on that count so I have attempted to reclaim integration testing and use the term quite specifically in the test automation pyramid. I use it to mean to test only one dependency and one interaction at a time. For example, I find this approach helpful with testing repositories (in the DDD sense). I do integration test repositories because I am interested in my ability to manage an object lifecycle through the database as the integration point. I therefore need tests that prove CRUD-type functions through an identity – Eric Evans (DDD, p.123) has a diagram of an object’s lifecycle that is very useful to show the links between an object lifecycle and the repository.
Using this diagram, we are interested in the identity upon saving, retrieval, archiving and deletion because these touch the database. For integration tests, we are likely to less interested in creation and updates because they wouldn’t tend to need the dependency of the database. These then should be pushed down to the unit tests – such as validation checking on update.
On the web services front, I am surprised how often I don’t have integration tests for them because the main job of my code is (a) test mappings between the web service object model and my domain model which is a unit concern or (b) the web services use of data which I would tend to make a system concern.
My checklist in the integration tests is somewhat less than unit tests:
- service references (very lightly done here just trying to think about how identity is managed)
- document creation (eg PDF)
System testing is the end-to-end tests and it covers any number of dependencies to satisfy the test across scenarios. We then are often asking various questions in the system testing, is it still working (smoke)? do we have problems when we combine things and they interact (acceptance)? I might even have a set of tests are scripts for manual tests.
Freeman and Pryce (p.10) write:
There’s been a lot of discussion in the TDD world over the terminology for what we’re calling acceptance tests: “functional tests”, “system tests.” Worse, our definitions are often not the same as those used by professional software testers. The important thing is to be clear about hour intentions. We use “acceptance tests” to help us, with the domain experts, understand and agree on what we are going to build next. We also use them to make sure that we haven’t broken any existing features as we continue developing. Our preferred implementation of the “role” of acceptance testing is to write end-to-end tests which, as we just noted, should be as end-to-end as possible; our bias often leads us to use these interchangeably although, in some cases, acceptance tests might not be end-to-end.
I find smoke tests invaluable. To keep them cost effective, I also keep them slim and often throw some away once I have found that the system has stabilised and the cost of maintaining is greater than the benefit. Because I tend to write web applications my smoke tests are through the UI and I use record-and-replay tools such as selenium. My goal with these tests is to target parts of the application that are known to be problematic so that I get as earlier a warning as possible. These types of system tests must be run often – and often here means starting in the development environment and then moving through the build and test environments. But running in all these environment comes with a significant cost as each tends to need its own configuration of tools. Let me explain because the options all start getting complicated but may serve as a lesson for tool choice. In one project, we used selenium. So we using selenium IDE in Firefox for record and replay. Developers use this for their development/testing (and devs have visibility of these tests in the smoke test folder). These same scripts were deployed through onto the website so that testing in testing environments could be done through the browser (uses selenium core and located in the web\tests folder – same tests but available for different runners). On the build server, the tests through seleniumHQ (because we ran Hudson build server) – although in earlier phases we had actually used selenium-RC with NUnit but I found that in practice hard to maintain [and we could have used Fitnesse alternatively and selenese!]. As an aside, we found that we also used the smoke tests as our demo back to client so we had them deploy on the test web server so that run through the browser (using the selenium core). To avoid maintenance costs, the trick was to write them specific enough to test something real and general enough not to break over time. Or if they do break it doesn’t take too long to diagnose and remedy. As an aside, selenium tests when broken into smaller units use a lot of block-copy inheritance (ie copy and paste). I have found that this is just fine as long as you use find-and-replace strategies for updating data. For example, we returned to this set of tests two years later and they broke because of date specific data that had expired. After 5 minutes I had worked out that the tests with a find-and-replace for a date would fix the problem. I was surprised to tell the truth!
Then there are the acceptance tests. These are the hardest to sustain. I see them as having two main functions: they have a validation function because they link customer abstractions of workflow with code (validation). The other is that they ensure good design of the code base. A couple of widely used approaches to acceptance type tests are FIT/Fitnesse/Slim/FitLibraryWeb (or StoryTeller) and BDD-style user stories (eg StoryQ, Cucumber). Common to both is to create an additional layer of abstraction for the customer view of the system and another layer than wires this up to the system under test. Both are well documented on the web and in books so I won’t go labour the details here (just a wee note that I you can see a .net bias simply because of my work environment rather than personal preference). I wrote about acceptance tests and the need for a fluent inteface level in a earlier entry:
I find that to run maintainable acceptance tests you need to create yet another abstraction. Rarely can you just hook up the SUT api and it works. You need setup/teardown data and various helper methods. To do this, I explicitly create “profiles” in code for the setup of data and exercising of the system. For example, when I wrote a Banner delivery tool for a client (think OpenX or GoogleAds) I needed to create a “Configurator” and an “Actionator” profile. The Configurator was able to create a number banner ads into the system (eg html banner on this site, a text banner on that site) and the Actionator then invoked 10,000 users on this page on that site. In both cases, I wrote C# code to do the job (think an internal DSL as a fluent interface) rather than say in fitnesse.
I have written a series of blogs on building configurators through fluent interfaces here.
System tests may or may not have control over setup and teardown. If you do, all the better. If you don’t you’ll have to work a lot harder in the setup phase. I am currently working on a project which a system test works across three systems. The first is the main (legacy – no tests) system, the second a synching/transforming system and the third which is the target system. (Interestingly, the third only exists because no one will tackle extending the first and the second only because the third needs data from the first – it’s not that simple but you get the gist.) The system tests in this case become one less of testing that the code is correct. Rather it is more one of does the code interact with the data from the first system in ways that we would expect/accept? As a friend pointed out this becomes an issue akin to process control. Because we can’t setup data in the xUnit fashion, we need to query the first system directly and cross reference against third system. In practice, the acceptance test helps us refine our understanding of the data coming across – we find the implicit business rules and in fact data inconsistencies, or straight out errors.
Finally manual tests. When using Selenium my smoke tests are my manual tests. But in the case of web services or content gateways, there are times that I want to just be able to make one-off (but repeatable) tests that require tweaking each time. I may then be using these to inspect results too. These make little sense to invest in automation but they do make sense to have checked into source leaving it there for future reference. Some SoapUI tests would fit this category.
A final point I want to make about the nature of system tests. They are both written first and last. Like story-test driven development, I want acceptance (story tests) tests written into the code base to begin with and then set to pending (rather than failing). I then want them to be hooked up to the code as the final phase and set to passing (eg Fitnesse fixtures). By doing them last, we already have our unit and integration tests in place. In all likelihood by the time we get to the acceptance tests we are now dealing with issues of design validation in two senses. One, that we don’t understand the essential complexity of the problem space – there are things in there that we didn’t understand or know about that are now coming into light and this suggests that we will be rewriting parts of the code. Two, that the design of code isn’t quite as SOLID as we thought. For example, I was just working on an acceptance test where I had do a search for some data – the search mechanism I found had constraints living the SQL code that I couldn’t pass in and I had to have other data in the database.
For me, tests are the ability for us as coders to have conversations with our code. By the time we have had the three conversations of unit, integration and acceptance, I find that three is generally enough to have a sense that our design is good enough. The acceptance is important because helps good layering and enforce that crucial logic doesn’t live in the UI layer but rather down in the application layer (in the DDD sense). The acceptance tests, like the UI layer, are both clients of the application layer. I often find that the interface created in the application layer not surprisingly services the UI layer that will require refactoring for a second client. Very rarely is this unhelpful and often it finds that next depth of bugs.
My system tests checklist:
The top level system and exploratory testing is all about putting it together so that people across different roles can play with the application and use complex heuristics to check its coding. But I don’t think the top level is really about the user interface per se. It only looks that way because the GUI is most generalised abstraction that we believe that customers and testers believe that they understand the workings of the software. Working software and the GUI should not be conflated. Complexity at the top-most level is that of many dependencies interacting with each other – context is everything. Complexity here is not linear. We need automated system testing to follow critical paths that create combinations or interactions that we can prove do not have negative side effects. We also need exploratory testing which is deep, calculative yet ad hoc that attempts to create negative side effects that we can then automate. Neither strategy aspires for illusive, exhaustive testing – as JB Rainsberger argues – which is the scam of integration testing.
Under this layering, I find a (dotnet) solution structure of three projects (
.csproj) is easier to maintain and importantly the separation helps with the build server. Unit tests are built easily, run in place and fail fast (seconds not minutes). The integration tests require more work because you need the scripting to migrate up the database and seed data. These obviously take longer. I find the length of these tests and their stability are a good health indicator. If they don’t exist or are mixed up with unit tests – I worry a lot. For example, I had a project that these were mixed up and the tests took 45 minutes to run so tests were rarely run as a whole (so I found out). As it turned out many of the tests were unstable. After splitting them, we got unit tests to under 30 seconds and then started the long process of reducing integration test time down which we got down to a couple of minutes on our development machines. Doing this required splitting out another group of tests that were mixed up into the integration tests. These were in fact system tests that tested the whole of the system – that is, there are workflow tests that require all of the system to be in place – databases, data, web services, etc. These are the hardest to setup and maintain in a build environment. So we tend to do the least possible. For example, we might have selenium tests to smoke test the application through the GUI but it is usually the happy path as fast as possible that exercises that all are ducks are lined up.
Still some problems
Calling the middle layer integration testing is still problematic and can cause confusion. I introduced integration because it is about the longer running tests which required the dependency of an integration point. When explaining to people, they quickly picked up on the idea. Of course, they also bring their own understanding of integration testing which is closer to the point that Rainsberger is making.
Some general points
- a unit test should be test first
- an integration test with a database tests is usually the test of the repository to manage identity through the lifecycle of a domain object
- a system test must cleanly deal with duplication (use other libraries)
- a system test assertion is likely to be test-last
- unit, integration and system tests should each have their own project
- test project referencing should cascade down from system to integration through to the standalone unit
- the number and complexity of tests across the projects should reflect the shape of the pyramid
- the complexity of the test should be inversely proportional to the shape of the pyramid
- unit and integration test object mothers are subtly different because of identity
- a system test is best for testing legacy data because it requires all components in place
- some classes/concerns might have tests split across test projects, e a repository class should have both unit and integration tests
- framework wiring code should be pushed down to unit tests, such as, in mvc routes, actions and redirects,
- GUI can sometime be unit tested, eg jQuery makes unit testing of browser widgets realistic
- success of pyramid is an increase in exploratory testing (not an increase in manual regressions)
- the development team skill level and commitment is the greatest barrier closely followed by continuous integration
- time taken to run tests is not inversely proportional to the shape of the pyramid
- folder structure in the unit tests that in no way represents the layering of the application
- only one test project in a solution
- acceptance tests written without an abstraction layer/library
- The Forgotten Layer of the Test Automation Pyramid
- From backlog to concept
- Agile testing overview
- Standard implementation of test automation pyramid
- Freeman and Price, 2010, Growing Object-Oriented Software, Guided by Tests
- Stephens & Rosenberg, 2010, Design Driven Testing: Test Smarter, Not Harder