Production tests: a guidebook for better systems and more sleep

Your customers expect your site to be fully working whenever they need it. This means you need to aim for near-perfect uptime not just for the site itself, but for all features customers may use.

Modern software engineering uses quality control measures such as automated test suites and observability tools (tracing, metrics, and logs) to ensure availability. Often overlooked in this landscape is production tests (also known as synthetics) that can give you immediate notification of failures in production.

Production tests can be setup up with minimal fuss—usually within one sprint—and can provide a high return on investment. In this post, I will cover how to best set up production tests and how they can help with reliability, deployments, and observability.

While I have always liked production tests, I got a real appreciation for them at Atlassian, where they are used extensively and are called “pollinators”, and I have seen first hand how they can give early warnings of problems, which can be fixed before the become incidents.

What are production tests?

A production test is any automated test that runs on the production environment. The test runs on a frequent schedule so that an on-call engineer can respond quickly. Typically, they run every minute. The test might use a headless browser to emulate user actions, or it may use an API directly to emulate the actions of browser code or backend service.

The production test should run in a reasonable time. I suggest 30 seconds or less, so that you can run the test easily once per minute. A test that takes longer than 30 seconds is probably to complex anyway for a production test. How the test deals with failure is up to your team. It could integrate with your on call paging system, send a Slack notification, or just log an error into your logging systems.

How do production tests help?

Productions tests help make your production environment more reliable by giving immediate warning of a regression. This means you can potentially fix issues before a customer discovers them.

In addition, the production test can be used as a canary before deployments, and as such acts as an integration test. The test can detect regressions that are caused by mismatches with other services, such as API shape or error-handling issues. You can run the test when deploying to development and staging environments too, to get a warning of an issue during development.

You also make it easier to debug production issues by having production tests. If you have an issue you are investigating, knowing which production tests are working and are failing gives you insight into what the issue might be. If you rely on other team’s services, and they also have production tests, you can look at their tests too to help with diagnostics.

Having production tests will reduce the time to recovery for any incident where a human has to fix it, as the engineer learns about the issue sooner, and has more information at hand to resolve it.

Important design considerations for production tests

For production-tests to be worth having, they need to be well-thought-out. If the test keeps failing it’ll probably get silenced and ignored. Even if the test is reliable it can cause other problems, such as resource usage impacting downstream systems. You may even have to change your systems to be a bit more testable. Here are some tips to consider when setting up your production tests:

Keep your production tests basic

Your production tests should cover less ground than your automated test suites. You want the tests to be reliable enough that they don’t waste your time with false alerts, which leads to frustration, lost time and possibly disabling of alerts. The goal of a production test is to be the canary that something has gone badly wrong. It gives you a head start on fixing it, hopefully before a customer gets affected.

To illustrate the required simplicity, here are some examples of what I think are good candidates for a production test:

  • Log in, and confirm you are on the home page, showing that user’s name.
  • Load the main editor of your app and type in Hello. Confirm Hello is there. Reload the browser and confirm Hello is still there.
  • Call 4 API endpoints to do CRUD operations for your microservice, using possibly some fake data in the API.
  • Ping /health and check for 200 response code within 250ms

Contrast this with these more complex and problematic production tests:

  • Run through a 25-step test that checks all the main functionality of the editor, asserting that elements are the correct size and are in the correct position.
  • Test that if you make call in quick succession to the API, they will be added to the database in the same order.
  • Check that the credit card page can differentiate between Visa and Mastercard based on the card number.
  • Ping /health and check for 200 response code within 1.5ms

The first examples are good because they are quick, reliable, and simple. They are not too functionality-specific, are so are unlikely to be broken by feature changes.

The bad examples on the other hand have problems:

The 25-step test will likely be flaky due to browser automation quirks, and feature changes. You will spend a lot of time investigating failures to the test, and concluding “It was just a one-off” or “Oh! It’s because we moved a button”.

The API calls in quick succession will sometimes fail due to network conditions changing the order in which the requests are received. It is not necessary to do something sophisticated in a production test—your goal is to know if something is horribly broken, not detect subtle timing issues.

The credit card test is probably OK in terms of not being flaky, but is a little too specific. You can almost entirely de-risk a bug in credit card UI behavior with tests that run before deployment, so do that instead.

The health check test expecting a fast response is likely to fail often enough to cause alarms that have no meaningful response. There are better ways to monitor latency in production that I will cover in a future post.

But… try go get some decent coverage

You are not aiming for anything near 100% code coverage, in fact code coverage doesn’t really matter here. That said, production tests should cover more than just “load the home page”. This is an art, but a guiding idea here is “if there is a serious problem, how likely are my production tests to detect it”. You are balancing that up with “how frequently will I get false reports of an outage”.

You also may want to think about the value of the things you are testing. Is it a minor feature used by 1% of users, or is it the page where new customers sign up? The latter is particularly important for growth companies who rely a lot on customer self-servicing to try out their software.

You don’t have to get this completely right from the beginning. You can add more tests next week. Furthermore, you can edit your existing tests, or even remove them too. Err on the side of too little coverage to begin with and consider adding more later. This way your team will get used to owning, tweaking and responding to production tests, without a cacophony of alarm bells!

As a rough guide, I would say test 3-5 simple things to begin with, with the goal to eventually move towards a reasonable amount of coverage. What is reasonable? You’ll find out after trying things out, and discussing outcomes with your team. There is no correct answer here.

Production tests are not health checks, but may overlap with them

A health check is usually a simple check that the server is running. For example, if using Node.js and Koa or Express, you might add a health route that just returns a success response on any invocation.

The purpose of this check is to assess basic server health. It allows load balancers to know which nodes to send traffic to, and deployment pipelines to know when a deployment has completed (or failed).

We expect these health checks to fail in production, possibly without affecting the customer. For example if a machine goes offline, the load balancer will detect this and stop sending traffic to that node. However, if there is no adverse effect of this on the consuming service or customer then there is no urgent problem. The system has self-healed.

Calling this health check on a node as a production test is not advisable, as it will cause false alarms. Even if it didn’t, a health check is not a good indicator of user experience.

The term “Health check” is also sometimes used to refer to checks that do more that just check the server is up. E.g., they may check dependencies such as storage systems, caches, queues and so on. A production test that calls such a health check could give early warning of an issue before it becomes noticeable by customers.

Be mindful of how your production tests affect observability

Running a test every minute, or 1440 times a day, will show up quite a lot in logs, metrics, and traces. This is often a good thing, because regions or services with very low traffic are now a bit more “observable” than they would be without such tests.

The downside is it may add costs by keeping resources spun up in those regions. Another downside is that being fake traffic, and sometimes requiring fake data (such as a fake user ID) can add noise to the logs you will sometimes need to filter out.

Fake data considerations

Consider a monolithic application — a single server that serves a website and all of its functionality. You need to write a test that logs in, enters some data in a field, saves it and checks it got saved. Such a test has a few challenges.

Firstly you need to decide how it logs in. Does it have a real account? If it does—what stops that account from expiring (e.g., from a free trial). You may need a discount code or special “fake credit card” that the system knows is for testing. If it is a “fake” account then how does that work? Is it hard coded somewhere.

The test is now generating data on each run. Will the success or failure of a previous run affect the current run of the test. Will running the test thousands of times a week use up storage space?

Think about these issues when planning the tests.

The tempting thing is to do the easiest thing to get it done. E.g., create an account using your staff email, worry about storage later, just get it working. This can come back to haunt you if the test breaks later, and you need to deal with upgrading the account, or maybe the member of staff whose email was used is on leave.

Making a production system testable may take some work and have special switches, such as user fields, or feature flags so that the behavior can be slightly different to facilitate the test.

If you use microservices, or a monolith with a separate auth service, there are ways to avoid needing a test user account. For example if your service uses a JSON Web Token (JWT) to authenticate, the authentication server could support logins for test systems, and your service can declare which tests it is authorizing to call it.

Three strikes before an alarm

If you have enough production tests, and run them on enough systems for long enough you will get false alerts. These can be caused by all sorts of things, network problems that only affect the test side, quirks in browser automation, or a genuine problem that heals itself two seconds later.

The simple way to avoid getting these kinds of false alerts is to wait for three consecutive failures before raising any alarm.

You might be giving me a side eye right now though, because should we be ignoring these false alerts? Probably not. If you have a team ritual of looking through operations data, you can incorporate non-alerting production rest failures into that ritual. The idea is that you prioritize it like any regular work, rather than making it an urgent thing that requires people to work out of hours, or stop their regular work.

Pros and cons of production tests

What is good about testing in production? I have hinted at a few of these already, however let’s get into the details:

  • Real world testing: Test suites—including unit tests and locally-run integration tests—that run on every build are definitely needed. However, even with all of there is nothing quite like test-driving the finished car, and that is what production tests are for. The proof of the pudding is in the eating.
  • Quality Control: The main reason for running tests is to know something is broken. Despite all the automated and manual testing you do on your machine and in staging environments, production is always unique. There are some things that only happen they way they do in prod. And even if your staging environment is precisely the same, generally speaking production will have more traffic. Running tests in production let you know if certain things are working in production.
  • Troubleshooting: If there is an incident where customers are impacted by a bug or outage, the tests are very useful. They may be failing or passing, but either way, their current state tells you something about the system, and can help narrow down the problem.
  • Observability for low traffic regions: If you deploy your service to multiple regions and some of those regions don’t get much traffic, the production tests will create synthetic traffic. Your observability metrics such as latency and reliability percentiles will be more meaningful with even just 1000 calls per day, than with close to zero traffic.
  • Safer Deployments: The same tests you use to monitor production can double up as tests you run to accept a blue/green deployment. These tests are then acting as continuous integration tests of last resort, adding to the assurance that your deployment won’t cause big issues.
  • Reuse in other environments: Just as you can use these tests for safer deployments, you can also use them in dev and staging environments, to get early detection of issues that your build pipeline cannot detect, such as integration issues with other services.

There are, of course some disadvantages to using production tests:

  • Setup and teardown challenges: Tests run on the real system, so you cannot wipe the database before each run. You need to figure out how to best set up the specific scenario you want to test for, and how to clean up after the test.
  • They sometimes need setup of scenarios: For example you have a test than you can upgrade your account, but you need an account set up that needs to be upgraded, and a fake payment method that will be accepted. These scenarios may mean changing production code to support them too.
  • They can be flaky: Any test that runs on a real system can fail from time to time for various reasons. For example network issues between the test and the service or occasional timing issues in the browser the test uses. Flakiness of tests needs to be taken into account if you intend to wake someone up when the test fails.
  • They cause resource usage and costs: Running tests do cost. Often tests are run across different geographic regions to ensure they are tested well, and hundreds of times a day. That compute cost adds up.
  • Human cost in maintaining tests: Not every test fail is an issue, but it does need to be investigated. Too many tests could leave you with a full-time position monitoring and making those tests more robust.

Production tests, vs. observability

You can also “test production” though monitoring of real traffic, and looking for problems there. This deserves a post or series of its own, but in short you can check for things like:

  • Latency, e.g., alert me when the top 99-percentile of latency for a particular endpoint is > 200ms, for 3 periods in a row.
  • Reliability, e.g., alert me when more than 0.1% of requests fail with a 5xx code.
  • Assertions, e.g., alert me when a code path that is considered unexpected or impossible has been triggered.
  • Failures, e.g., alert me when a customer couldn’t perform a task because they got an error.

The good thing about observability-based alerts is they are very simple to set up. It may need some minor code changes, and then setting up detectors in observability tools, that take look for and take an action when the condition is met.

These tests pair well with production tests. They have a different purpose though, since they detect problems after they have happened, and they piggyback on existing system use. They can be helped by production tests creating additional synthetic traffic which they can then monitor, which lets you keep checking things when natural traffic levels fall off.

There is no “versus” though! You would do well to have both. For a particular need it might be better to use a production test or better to use observability. A rule of thumb is if observability can meet the requirements of alerting you to a problem, then it will be easier to set up and tune to your needs than a production test.

Summary

Phew! We covered a lot of ground there. In a nutshell adding well-designed production tests to your systems will help in a number of ways. You’ll get earlier warnings of issues, you will be able to fix them faster, you will get observability benefits, and you can also use production tests as a deployment rollout test.

It is worth adding production tests if you are not doing so already—it is a relatively small task to fit in to your planning, and you will reap rewards. If you are already using them, keep reviewing them, and see what you need to add, remove and tweak to get the most value from them as your systems evolve.

Happy testing!

Human-made Content
×