Leveraging AI to transform manual tests into automated scripts

three colleagues working on a laptop and reviewing a document

Squeezing AI into automation might not solve your pain points, but it can alleviate problem areas you hadn’t considered

We’re all looking for new and innovative ways of improving software delivery. AI is assisting us in many areas of day-to-day test activities and we’d like to share one way we’re using it to alleviate project team pressures.

We focus primarily on using AI as a force multiplier, whilst recognising the typical capabilities within teams. Test automation solves many problems, but not every project team has the capacity or the capability to use it. Ideally, you’d already have it in place as a shift left approach with all the bells and whistles, but what if your team hasn’t had the chance to get started?

Let’s take you behind the scenes of the Ten10 Consultancy project to read our findings from our experiment into AI-driven test automation.

Scenario

We looked at how to lower the barrier of entry into automation testing for a project team which had some capacity but lacked the capability. There are many low/no-code solutions offering many out-of-the-box capabilities but, in the spirit of innovation, we looked at how we could convert existing manual test scripts into automated tests by parsing expected/actual behaviour using GenAI – including whether they could then be later converted into an existing low/no-code test solution!

Plan

As part of automating existing test scripts, we created an ‘Automation Engine’ which would consume our test scripts, converse with an LLM and conduct an action on a browser. This action could be as simple as loading a page, interacting with form inputs, or clicking elements and later waiting for content to appear.

The output depended on the quality of the input, Garbage In = Garbage Out (GIGO) applying as always. LLMs are capable of incredible feats but can also fall at the easiest hurdle.

Our proof-of-concept (PoC) was inspired by web scraping bots. These bots have always existed and would function as spiders and iterate through various links to find relevant content or extract information from predetermined formats to later manipulate.

With the infusion of AI, web scrapers have become smarter to handle certain conditions. We employed similar techniques in the PoC, making the automation test generation more autonomous. All we needed to do was combine modern web scraping techniques with our manual test scripts.

Results

We looked at the test scripts we had for an accessibility-friendly website, which in turn makes it easier to automate, and found that our test scripts were able to be consumed by the LLM and drive the browser. After minor tweaking, it could drive the browser with minimal inference costs. Fortunately, the scripts it was consuming were detailed with sufficient information. So, what did this actually look like?

We had an input such as:

Step NoStepExpected Behaviour
nUsing the global search bar, search for TeaAutocomplete should not fire until the third character has been entered. 

All the autocomplete results should be relevant to the search term.

When completing the search, the user should be redirected to the search results page clearly showing the search criteria at the top of the page and all results should be relevant to the search criteria.

This was loaded with a fair amount of complexity for a ‘simple’ test step, which initially the LLM did not parse effectively. It was successful in creating automation code to identify the element, populate it, and search, but it either ignored the auto-complete steps or produced code that was not effective.

To resolve this, we used a ‘chain of thought’ prompting technique for it to evaluate what it was going to do up to a certain number of iterations. This resulted in many attempts of the same test step, which is not ideal if there are single-use data restrictions. We later added a step between the initial test and the automation engine to further break the steps down. This would then enter the first three characters, confirm the autocomplete, and finally complete the search.

This would transform the test step above into a series of actions, which looked like:

Action
Enter Tea into the global search field
Wait for the dropdown to appear containing items relevant to Tea
Complete the global search
Wait for the search results page to load
Confirm the page contains products related to Tea

At this point, we saw the value in how existing manual tests can be autonomously made automation-friendly and then be autonomously automated later. Being able to break down the test into smaller actions, almost as if it were a clear plan, helps with the explainability nature of the test. However, it could be argued that it is simply a basic capture and record framework with extra steps.

Improvements

To make this more robust, we introduced the page object model. On every page event, code was introduced to scrape possible selectors to assist in the LLM decision-making process. The selectors that would get selected would later be added to the page objects for future use with additional descriptions to assist with ranking selectors (and further aid the LLM decision-making process).

There were security considerations on how code was being generated by the LLM and immediately executed without human oversight. To overcome this, we moved away from code being generated that would later be fed into an “eval” command and instead generated JSON that could be parsed containing the intended action of the LLM. This, in turn, reduced the amount of hallucination in the model’s response, as the available actions were always available in its immediate context which it had the opportunity to retrieve through one of its agent “tools”.

Hallucinations are always going to be a problem with GenAI and we had to provide it with separate locator identifiers when asking it to return the appropriate locator to interact with as it had a habit of combining two selectors together. Implementing a secondary identifier in the form of a UUID and asking it to return the selector UUID eliminated this hallucination issue.

Unfortunately, the PoC also had the undesirable benefit of being ‘positive’. For example, when it was trying to find text that was not present on the page it would match similar text and pass the step as it thought this was correct. This would result in false negatives, which is one of the worst things an automation framework can have! As such, combining this with the visual testing approach is crucial to increase confidence and reduce risk.

An example of this is when we provided the following prompt:

Then I can see that a message has appeared about rejecting work

This was an edge case to see if it would pass or fail given that the text on the page was as follows:

You’ve rejected analytics cookies. You can change your cookie settings at any time.

In this example, the step was marked as passed as the GenAI thought it was correct. To avoid this from occurring, we had to provide strings that it would do an exact text match on to ensure whether it is present or not. This isn’t ideal given it would not be as effective at self-healing. There are other approaches to handle this, highlighting the importance of introducing a visual testing approach running alongside.

When it came to regular execution, we had the AI regenerate the test on failure as a form of self-healing. This would either be for the standalone step or the full script, and it would be included in the report. Combining this with visual testing approaches would reduce the risk of false negatives, especially when regenerating entire tests on the fly. The regeneration increased the cost per execution. However, it would reduce maintenance costs in the long run as long as an element of visual regression is in place.

flowchart from ten10 ai script experiment

Findings

Overall, we were impressed by how effective it was to go from zero to hero. However, the immediate restrictions encountered concerned handling large websites and non-accessibility-friendly websites. We found that pages with many html elements would exceed the context window available and a different approach was required. The model we used had a 32k context window and we were running inference from a secure VM to trial how it can be used on sensitive projects with data considerations. Using a public model such as ChatGPT will almost certainly perform better and allow you to be charged at a pay-per-use model instead.

The approach made light work of an accessibility-friendly website, and it could quickly convert the regression tests into automatable tests that could be added into a CI/CD pipeline with ease. However, there will always be an element of technical involvement. Even specifying which environment and having the automation tests switch from one environment to another would require some tweaking to the generated suite.

We found that for the project it was being used in, the gap between the non-technical (prompt-driven script) and the technical (platform to drive prompt-driven scripts with a browser) was significantly larger than a typical ‘manual testers write Cucumber’ tests approach. However, the prompt-driven scripts approach allowed tests to be driven that had not previously been defined, which would be required with Cucumber. The near-term use of this concept may be the middle ground between the two, converting manual regression scripts into Cucumber.

As always, each project is different and this approach will not suit all. Using the above GenAI approach to transfer all your manual tests into a regression pack is likely to leave you with a headache around data requirements and execution times. There are currently too many variables to do a blind transfer of all your manual tests. However, it is well worth considering when it comes to easing the process of introducing automation testing in a project that lacks both capability and capacity.

Although this approach is still in its infancy, we see great potential in this approach and are looking to refine our implementation to be more generic and work effectively across varied projects. Certainly the tendency for false negatives to slip by shows that there will still be a human in the loop for a while yet!

Interested in how you can leverage AI in testing? Speak to our consultants

We’re always testing and pushing the limits of how AI can be implemented into existing test practices to improve everything from the time it takes to write tests to test coverage and reliability. Speak with our consultants to learn how we can help you adopt AI and implement it into your solutions.