Blog

Blog

Blog

9 minutes

9 minutes

9 minutes

How We Built Automated Testing with Playwright

Eric Skram

Founder & Senior Staff Engineer

Jan 29, 2025

At Tempest, customer trust is our first priority, and our goal ensuring that we're shipping reliable, high-quality software is a core tenet of that.

When we first started building Tempest, our testing approach relied heavily on manual and unit testing to catch bugs early. Fast-forward to today, and we’ve fully leaned into automated testing using Playwright.

This post explores how we evolved our testing process, from experimenting with different approaches to adopting Playwright for automated testing, which enabled us to ship faster, catch more bugs, and improve our developer experience.

Why automated testing?

We’re a small team building an ambitious product, so every hour invested into a project like this meant less time spent on major product features.

But as every release saw more bugs slip through our manual QA process, we were burning precious time triaging and fixing issues right before (and sometimes after) release. With our team size and an increasingly large product surface area, continuing to manually test every feature for every release just wasn’t sustainable and, frankly, wasn’t very good DevEx.

We knew prioritizing a better testing solution would be critical to delivering high quality software to our customers. A couple of us blocked off a week to get up-to-speed with best practices and build a proof-of-concept.

Phase 1: Cucumber-js + Playwright the library

Our initial stack used Cucumber JS backed by Playwright as the test executor. Compared to the boilerplate-heavy tests of yore, using the Gherkin pseudo-natural-language syntax was a breath of fresh air. Here's an example of one test for the Teams feature:


But we quickly faced some configurability challenges:

  • Running Playwright in library mode limited the APIs and devtools available to us. Piping configuration options through the cucumber-js CLI to Playwright resulted in a bunch of boilerplate code.

  • Getting TypeScript & ES Modules playing nicely with cucumber-js was non-trivial. The rest of our codebase runs TypeScript natively (Deno) or zero-config with Vite and that wasn't our experience here.

  • Some tests would fail intermittently when parallelized and running at full speed, so we had to dial it back using Playwright's slowMo option and run the tests serially.

Even with these challenges, it all worked! The entirety of Tempest could be run on a single machine, so the barrier to getting a test suite up and running locally was relatively low. From there, it was also straightforward to run the suite in our continuous integration (CI) environment.

This got us off the ground to start writing tests, but as time went on, the rough edges started to feel rougher. Test runs in CI were taking 20+ minutes for a comparatively small test suite and the developer experience for writing tests wasn't great. For a company focused on DevEx, this certainly didn't meet our quality bar. Time to keep iterating.

Phase 1.5: Playwright & code generation

During some late-night test debugging, we came across Playwright-BDD, which seemed like it could offer a meaningful step forward. We could run Playwright in "Test" (aka normal) mode and take advantage of all the great work the Playwright team put in to make it easier to write tests—while still being able to use our existing catalog of Gherkin tests.

Given this Gherkin test:


Code generation would spit out this test file:

// Generated from: sample.feature
import { test } from "playwright-bdd";

test.describe("Playwright site", () => {
  test("Check get started link", async ({ Given, When, Then }) => {
    await Given("I am on home page");
    await When('I click link "Get started"');
    await Then('I see in title "Installation"');
  });
});

Playwright-BDD allowed us to invert the relationship between Cucumber and Playwright. The Playwright-BDD CLI would parse our existing Gherkin tests and generate Playwright-compatible tests and fixture code, which would then be run by Playwright as if it were a hand-written test. Playwright-BDD recently dropped the cucumber-js dependency in favor of their own parser, which means that TS and ES Modules went back to Just Working™.

Porting over the first couple tests and seeing them run visually in Playwright's UI mode was a huge win. Being able to see what elements were being clicked and visually debug tests was a dramatically better experience. With a comparatively low lift, we managed to fix a few of the largest pain points with our existing test suite.

Porting over a large number of tests also gave us a unique insight to the entire state-of-the-codebase that we didn't have when writing tests during active development. With Gherkin syntax, you're trading flexibility for readability and reuse. The idea that you can reuse selectors (the automated interaction backing a Gherkin statement) across tests is alluring. The idea of having a finite API for interacting with the Tempest app is great in theory. In practice however, there were some painful, repetitive patterns that we kept running into. The foremost of which was scoping selectors and assertions:

  • For all interactions (clicking, typing, ...) and many assertions, Playwright needs a handle to a specific DOM element in the page.

  • It’s common to have the same text appear more than once on a page. As a human being interacting with the website, I can easily tell the difference between a "Delete Team" button inside a confirmation dialog and a "Delete Team" button on the main page. Playwright needs more specificity to work for specific use cases.

  • Playwright answers this by chaining selectors to define a scope for the interaction:

// Find the dialog.
const dialog = page.getByRole("dialog");
// Scope the selector/click to within the dialog.
await dialog.getByRole("button", { name: "Delete Team" }).click();

This didn’t play nicely with the Gherkin language. We weren't able to find a convenient way to define a scope for subsequent steps. What we really wanted to write was something like this:


Playwright-BDD documents one way to do this using test contexts, but it was boilerplate-heavy and didn't achieve exactly what we wanted. Instead, we ended up writing more selectors specific to certain types of interactions. Eg: "click a button inside a modal" was a separate step from "click a button".

The only answer we had to all these was... more selectors. As the depth of interactions we were testing increased (keyboard selection, right-clicking contextual menus, etc.) we found ourselves writing more and more pseudo-bespoke selectors and having to go through a guess-and-check process to find the right selector for a given interaction. Here's one of our (admittedly more egregious) selector files:

When("I click {string}", async function (this: World, name: string) {
  await click(this.page, name);
});

When(
  "I click {string} at position {int}",
  async function (this: World, name: string, position: number) {
    await click(this.page, name, false, position);
  }
);

When("I click on {string}", async function (this: World, name: string) {
  await click(this.page, name, false);
});

When(
  "I click the {string} menu item",
  async function (this: World, name: string) {
    // abstraction around `click()`.
    await clickMenuItem(this.page, name);
  }
);

// about 100 more lines of these...

As a developer writing a test, would you know off-the-bat which selector you'd need to use to click on an element? I certainly didn't and I wrote most of them. This problem was always there, it was just wallpapered over by the bigger problems mentioned above. We took a step back and started thinking about our outcome. Why were we writing tests and what was important?

The outcome we were after was shipping a reliable product and writing as many impactful tests as possible, as quickly as possible. We realized that our existing suite of Gherkin tests were, while effective, not the end-goal of our testing adventure. Iterating on our testing technology stack while treating the Gherkin tests as a "fixed point" was a sunk cost fallacy. If using Gherkin would yield us the best outcome, great. If not? We'd be better off doing something else.

For me, I found that "something else" in this talk by Microsoft Senior Program Manager Debbie O'Brien.

Phase 2: Playwright & the IDE

or: How I Learned to Stop Worrying and Love the VS Code Plugin

Every approach we'd taken thus far has resulted in "effective" tests. They ensured that Tempest was working as expected. The suite would run and verify that we're not introducing regressions. However, they all failed to measure up in the developer experience. Migrating from Cucumber-js to Playwright-BDD gave us a much more solid technical foundation and made the experience of running tests much better, but it did little to improve the process of writing tests. Our bet was that pushing for the best developer experience for writing tests would yield the best outcome.

Our initial experience with Playwright's UI mode was great and our entire team uses Cursor or VS Code—so what if we fully leaned into the Playwright-suggested method of generating tests using the VS Code extension? We tried this with a couple tests of varying complexity and it was an immediate hit on a few levels:

  • Recording tests by clicking around the app is a literal game-changer

  • Dramatically less alt-tabbing between editor, testing UI, and terminal

  • Easily run individual tests from the editor, rather than by passing CLI flags

Once we'd reworked the authentication/setup flow (I'll cover this in a future post!) Our process for migrating the remaining Gherkin tests was to:

  1. Stub out a Playwright test file and hit record

  2. Use the Gherkin test as a guide to click around the app

  3. Save the test and move on to the next one

  4. No, really, that's it

For repeated interactions (such as authenticating with the Tempest GitHub app in Recipes), we abstracted that out into Playwright fixtures, but it was easy to do this after initial tests had been written. The most painful process was starting from a blank canvas and figuring out which selectors to compose to verify the behavior—which we completely eliminated.

So, how'd it go?

Pretty well! Qualitative feedback from the team is positive, and we’ve found that they run the test suite more frequently during development to catch regressions. Quantitatively, people have been writing more exhaustive tests to accompany their work. Rather than testing just the happy path, we've seen more tests for edge cases. We found and fixed several product bugs during the migration process, which was a good sign.

There's something ironic about our destination on this journey being "Just do it the way that Playwright suggests," but it's a good reminder to trust authors of great projects to understand the best way to use their tools.

Next steps

Our goal as an engineering team is to move as quickly as possible without sacrificing quality. By embracing the latest innovations in automated testing with Playwright, we’ve been able to ship faster and with more confidence.

Of course, there’s always more work to do. We’re continuing to invest in visual regression testing and automating our accessibility checks using Axe so we can continue delivering reliable, high-quality software our customers expect from us.

If you liked this, follow me on Bluesky, LinkedIn, and GitHub!

Share

Getting started is fast and free.

Try Tempest today

Getting started is fast and free.

Try Tempest today

Getting started is fast and free.

Try Tempest today