SlideShare a Scribd company logo
Mutation testing
for task-oriented chatbots
Pablo Gómez-Abajo, Sara Pérez-Soler, Pablo C. Cañizares,
Esther Guerra, Juan de Lara
{Pablo.GomezA, Sara.PerezS, Pablo.Cerro, Esther.Guerra, Juan.deLara}@uam.es
Modelling & Software Engineering Research Group
Universidad Autónoma de Madrid, Spain
18th – 21st June 2024
• Conversational agents or chatbots are increasingly used to access
all sort of services using natural language
• Like any other software, chatbots need to be tested
• Usually by defining test scenarios
• However
• There is currently a lack of methods to assess the quality of such
test scenarios
• The result is a high risk of buggy chatbots
• Conversational agents or chatbots are increasingly used to access
all sort of services using natural language
• Like any other software, chatbots need to be tested
• Usually by defining test scenarios
• However
• There is currently a lack of methods to assess the quality of such
test scenarios
• The result is a high risk of buggy chatbots
What is a task-oriented chatbot?
• A task-oriented chatbot is a software application used in natural language
and designed to solve a specific task
• e.g., booking a ticket, ordering a pizza, setting a medical appointment
• Via text or speech recognition
• In recent years, the use of chatbots has increased
…and many more
• Since 2022, we also have open-domain chatbots (ChatGPT, etc.) which engage in conversations
on any topic, and which we do not cover in this work

How do chatbots work?
NL phrase
How do chatbots work?
NL phrase
match intent
How do chatbots work?
1. The user sends a natural language
message to the chatbot Utterances
Utterances (user says)
Hi there!
I need to fly from Madrid to Salerno on
Wednesday at 12 PM
Good bye!
How do chatbots work?
1. The user sends a natural language
message to the chatbot
2. The chatbot tries to match the
message with an intention

How do chatbots work?
1. The user sends a natural language
message to the chatbot
2. The chatbot tries to match the
message with an intention
How do chatbots work?
Hi there!
Intent: Match the user interaction with
an intention
User says Intent
Hi there!
How do chatbots work?
Hi there!
Intent: Match the user interaction with
an intention
User says Intent
Hi there! Greet
How do chatbots work?
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Intent: Match the user interaction with
an intention

How do chatbots work?
I need
to fly
matched User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
Intent: Match the user interaction with
an intention
How do chatbots work?
I need
to fly
matched User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
Intent: Match the user interaction with
an intention
How do chatbots work?
I need
to fly
matched User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
Providing training phrases: a set of examples that users can use to
express an intention. Required for matching inputs with intents
Intent: Match the user interaction with
an intention
How do chatbots work?
Hi there
Training phrases: a set of examples
that users can use to express an
● Must be provided with the intent
Training phrase Intent
Hi there! Greet
Hello Greet
Hi Greet
Hey Greet

How do chatbots work?
Training phrases: a set of examples
that users can use to express an
● Must be provided with the intent
I need
to fly
matched Training phrase Intent
Airplane ticket from
Madrid to Rome
tomorrow at 1 pm
Book a flight
Flight from Madrid
to Napoli on
17/06/2024 at 11:30
Book a flight
How do chatbots work?
3. Chatbot extracts information from
the message or asks for missing
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
How do chatbots work?
3. Chatbot extracts information from
the message or asks for missing
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
At this point, the chatbot extracts key information from the input: parameters
From:Madrid when:Wed. At 12 PM
How do chatbots work?
3. Chatbot extracts information from
the message or asks for missing
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
At this point, the chatbot extracts key information from the input: parameters
From:Madrid when:Wed. At 12 PM

How do chatbots work?
3. Chatbot extracts information from
the message or asks for missing
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
At this point, the chatbot extracts key information from the input: parameters
From:Madrid when:Wed. At 12 PM
City entities
How do chatbots work?
3. Chatbot extracts information from
the message or asks for missing
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
At this point, the chatbot extracts key information from the input: parameters
From:Madrid when:Wed. At 12 PM
City entities
How do chatbots work?
4. Build the response and send back
the response to the user
I need
to fly
● Responses to the user:
○ text, images
● External service queries
○ External API rest
○ Database, etc.
User says Action
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
The price of the
ticket is 150$.
Provide a card
nº and billing
Both, user responses and external services queries: actions
Testing chatbots
Testcase input Testcase output
Hi there! Hi! How can I help
How can I
Help you?

Testing chatbots
We use Botium and Rasa-test as the test suites to test the chatbots
Hi there!
What day do you want to come in?
Single test interaction
Combination of multiple tests
Hi there!
Hi! How can I help you?
Hello, what do you need?
Greetings! This is the flight ticket
assistant Antony, how can i help you?
Multiple user utterances
Possible responses
Testing chatbots
I need to fly
from …
How can I
Help you?
The price
of the
ticket …
I lost my
the flight
ticket id
… and complex
Mutation testing for chatbots
User says Action
What kinds of coffee are available?
What kinds of coffee can I order?
What can I drink here?
Tell me what drinks there are
You can take an
expresso or an
NL phrase
Order a
match intent
Order a
User says Action
What kinds of wine are available?
What kinds of wine can I order?
What can I drink here?
Tell me what drinks there are
You can take an
Italian wine or a
French wine
Order a coffee
Order a wine
Tell me what
kinds of coffee I
can drink here
Mutation testing for chatbots
User says Action
What kinds of coffee are available?
What kinds of coffee can I order?
What can I drink here?
Tell me what drinks there are
You can take an
expresso or an
NL phrase
Order a
match intent
Order a
User says Action
What kinds of wine are available?
What kinds of wine can I order?
What can I drink here?
Tell me what drinks there are
You can take an
Italian wine or a
French wine
Tell me what
kinds of coffee I
can drink here
Order a coffee: Keeps the two most different phrases
Order a wine
Semantic similarity

Mutation testing for chatbots
User says Action
What kinds of coffee are available?
What kinds of coffee can I order?
What can I drink here?
Tell me what drinks there are
You can take an
expresso or an
NL phrase
Order a
match intent
Order a
Order a coffee: Keeps the two most different phrases
User says Action
What kinds of wine are available?
What kinds of wine can I order?
What can I drink here?
Tell me what drinks there are
You can take an
Italian wine or a
French wine
Order a wine
Tell me what
kinds of coffee I
can drink here
Mutation testing for chatbots
User says Action
What can I drink here?
Tell me what drinks there are
You can take an
expresso or an
NL phrase
Order a
match intent
Order a
User says Action
What kinds of wine are available?
What kinds of wine can I order?
What can I drink here?
Tell me what drinks there are
You can take an
Italian wine or a
French wine
Order a wine
Tell me what
kinds of coffee I
can drink here
Order a coffee
Mutation testing for chatbots
User says Action
What can I drink here?
Tell me what drinks there are
You can take an
expresso or an
NL phrase
Order a
match intent
Order a
User says Action
What kinds of wine are available?
What kinds of wine can I order?
What can I drink here?
Tell me what drinks there are
You can take an
Italian wine or a
French wine
Order a wine
Tell me what
kinds of coffee I
can drink here
Order a coffee
Mutation testing for chatbots
User says Action
What can I drink here?
Tell me what drinks there are
You can take an
expresso or an
NL phrase
Order a
match intent
Order a
User says Action
What kinds of wine are available?
What kinds of wine can I order?
What can I drink here?
Tell me what drinks there are
You can take an
Italian wine or a
French wine
Order a wine
Tell me what
kinds of coffee I
can drink here
Order a coffee

Mutation testing for chatbots
User says Action
What can I drink here?
Tell me what drinks there are
You can take an
expresso or an
NL phrase
Order a
match intent
Order a
User says Action
What kinds of wine are available?
What kinds of wine can I order?
What can I drink here?
Tell me what drinks there are
You can take an
Italian wine or a
French wine
Order a wine
Tell me what
kinds of coffee I
can drink here
Order a coffee
Operators for training phrases
DPmax Deletes the most representative phrase of
an intent
DPmin Deletes the most different phrase of an
DPWP Deletes training phrases with required
DPWL Deletes training phrases with literal
K2Pmax Keeps the 2 most representative phrases
K2Pmin Keeps the 2 most different phrases
Moves the most representative phrase to
the most similar intent
Moves the most different phrase to the
most different intent
Mutation operators for chatbots
Operators for intents
DIP Deletes intent parameter
DPP Deletes parameter prompt
SPO Sets required parameter to optional
DFI Deletes fallback intent
Operators for entities
CRE Changes regular expression
DLE Deletes literal from entity
Operators for actions
DA Deletes actions
DPR Deletes a parameter used in a response
SO Swaps outputs
Operators for conversation flows
DCS Deletes conversation step
DCB Deletes conversation bifurcation
Emulation of common errors of chatbot developers
«conforms to»
«conforms to»
test suites
chatbot impl.
Mutation testing for chatbots
RQ1: How applicable are the defined mutation operators?
RQ2: How effective are the defined mutation operators?
77% 73%
78% 80%
0% 0%
89% 87%
Mutation score by
mutation operator

RQ1: How applicable are the defined mutation operators?
RQ2: How effective are the defined mutation operators?
77% 73%
78% 80%
0% 0%
89% 87%
Mutation score by
mutation operator
RQ3: How effective is the mutation testing process?
Botium automatic Botium by hand Rasa test
Mutation score
by test suite kind
RQ3: How effective is the mutation testing process?
Botium automatic Botium by hand Rasa test
Mutation score
by test suite kind
RQ4: How efficient is the mutation testing process?
0,1% 0,2% 0,3%
1,0% 1,2% 1,4% 1,6% 1,6% 1,7%
The mutation testing
process of 67% of the
chatbots was completed
in less than 90 minutes

RQ4: How efficient is the mutation testing process?
0,1% 0,2% 0,3%
1,0% 1,2% 1,4% 1,6% 1,6% 1,7%
The mutation testing
process of 67% of the
chatbots was completed
in less than 90 minutes
• Technology-independent approach for MuT of chatbots with
• A catalogue of 19 mutation operators for
• Training phrases, intents, entities, chatbot actions and conversation flows
• Support for test scenarios from botium and rasa-test
• Experiment with 15 chatbots and 29 test suites
• Positive results regarding applicability, effectiveness and efficiency
• Room for improvement in 86% of the test suites
• MuT for chatbots running times are costly but acceptable
• Less than 90 minutes for 67% of the chatbots
Future work
• Automate the detection of semantically equivalent mutants
• e.g., using confidence decrease heuristics
• Automate the synthesis of tests able to kill the alive mutants
• Adapt our approach to LLM-based agents
Pablo Gómez-Abajo, Sara Pérez-Soler, Pablo C. Cañizares,
Esther Guerra, Juan de Lara
{Pablo.GomezA, Sara.PerezS, Pablo.Cerro, Esther.Guerra, Juan.deLara}@uam.es
Mutation testing
for task-oriented chatbots
Thank you!
./ Wodel-Test
Tool demo

Mutation Testing for Task-Oriented Chatbots

  • 1. www.uam.es Mutation testing for task-oriented chatbots Pablo Gómez-Abajo, Sara Pérez-Soler, Pablo C. Cañizares, Esther Guerra, Juan de Lara {Pablo.GomezA, Sara.PerezS, Pablo.Cerro, Esther.Guerra, Juan.deLara}@uam.es Modelling & Software Engineering Research Group Universidad Autónoma de Madrid, Spain 18th – 21st June 2024
  • 2. Motivation • Conversational agents or chatbots are increasingly used to access all sort of services using natural language • Like any other software, chatbots need to be tested • Usually by defining test scenarios • However • There is currently a lack of methods to assess the quality of such test scenarios • The result is a high risk of buggy chatbots 2/25
  • 3. Motivation • Conversational agents or chatbots are increasingly used to access all sort of services using natural language • Like any other software, chatbots need to be tested • Usually by defining test scenarios • However • There is currently a lack of methods to assess the quality of such test scenarios • The result is a high risk of buggy chatbots 2/25
  • 4. What is a task-oriented chatbot? • A task-oriented chatbot is a software application used in natural language and designed to solve a specific task • e.g., booking a ticket, ordering a pizza, setting a medical appointment • Via text or speech recognition • In recent years, the use of chatbots has increased …and many more • Since 2022, we also have open-domain chatbots (ChatGPT, etc.) which engage in conversations on any topic, and which we do not cover in this work 3/25
  • 5. How do chatbots work? 4/25 User NL phrase Chatbot chatbot response
  • 6. How do chatbots work? 5/25 User NL phrase intent1 intentn Chatbot match intent … intenti … chatbot response 3 extract params build response external service 1 4 2 3
  • 7. How do chatbots work? 6/25 1. The user sends a natural language message to the chatbot Utterances Utterances (user says) Hi there! I need to fly from Madrid to Salerno on Wednesday at 12 PM Good bye!
  • 8. How do chatbots work? 7/25 1. The user sends a natural language message to the chatbot 2. The chatbot tries to match the message with an intention
  • 9. How do chatbots work? 7/25 ?? Intention? 1. The user sends a natural language message to the chatbot 2. The chatbot tries to match the message with an intention
  • 10. How do chatbots work? 8/25 Hi there! Intent: Match the user interaction with an intention User says Intent Hi there!
  • 11. How do chatbots work? 8/25 Hi there! Intent matched Intent: Match the user interaction with an intention User says Intent Hi there! Greet
  • 12. Book How do chatbots work? 9/25 I need to fly User says Intent I need to fly from Madrid to Salerno on Wednesday at 12 PM Intent: Match the user interaction with an intention
  • 13. Book How do chatbots work? 9/25 I need to fly Intent matched User says Intent I need to fly from Madrid to Salerno on Wednesday at 12 PM Book a flight Intent: Match the user interaction with an intention
  • 14. Book How do chatbots work? 9/25 I need to fly Intent matched User says Intent I need to fly from Madrid to Salerno on Wednesday at 12 PM Book a flight HOW?! Intent: Match the user interaction with an intention
  • 15. Book How do chatbots work? 9/25 I need to fly Intent matched User says Intent I need to fly from Madrid to Salerno on Wednesday at 12 PM Book a flight HOW?! Providing training phrases: a set of examples that users can use to express an intention. Required for matching inputs with intents Intent: Match the user interaction with an intention
  • 16. Book How do chatbots work? 10/25 Hi there Intent matched Training phrases: a set of examples that users can use to express an intention ● Must be provided with the intent Training phrase Intent Hi there! Greet Hello Greet Hi Greet Hey Greet
  • 17. Book How do chatbots work? 11/25 Training phrases: a set of examples that users can use to express an intention ● Must be provided with the intent I need to fly Intent matched Training phrase Intent Airplane ticket from Madrid to Rome tomorrow at 1 pm Book a flight Flight from Madrid to Napoli on 17/06/2024 at 11:30 Book a flight
  • 18. How do chatbots work? 12/25 3. Chatbot extracts information from the message or asks for missing information I need to fly User says Intent I need to fly from Madrid to Salerno on Wednesday at 12 PM Book a flight
  • 19. to:Salerno How do chatbots work? 13/25 3. Chatbot extracts information from the message or asks for missing information I need to fly User says Intent I need to fly from Madrid to Salerno on Wednesday at 12 PM Book a flight At this point, the chatbot extracts key information from the input: parameters From:Madrid when:Wed. At 12 PM
  • 20. to:Salerno How do chatbots work? 13/25 3. Chatbot extracts information from the message or asks for missing information I need to fly User says Intent I need to fly from Madrid to Salerno on Wednesday at 12 PM Book a flight At this point, the chatbot extracts key information from the input: parameters From:Madrid when:Wed. At 12 PM City
  • 21. to:Salerno How do chatbots work? 13/25 3. Chatbot extracts information from the message or asks for missing information I need to fly User says Intent I need to fly from Madrid to Salerno on Wednesday at 12 PM Book a flight At this point, the chatbot extracts key information from the input: parameters From:Madrid when:Wed. At 12 PM City entities
  • 22. to:Salerno How do chatbots work? 13/25 3. Chatbot extracts information from the message or asks for missing information I need to fly User says Intent I need to fly from Madrid to Salerno on Wednesday at 12 PM Book a flight At this point, the chatbot extracts key information from the input: parameters From:Madrid when:Wed. At 12 PM Time City entities
  • 23. How do chatbots work? 14/25 4. Build the response and send back the response to the user I need to fly ● Responses to the user: ○ text, images ● External service queries ○ External API rest ○ Database, etc. User says Action I need to fly from Madrid to Salerno on Wednesday at 12 PM The price of the ticket is 150$. Provide a card nº and billing name Both, user responses and external services queries: actions
  • 24. Testing chatbots 15/25 User Chatbot Testcase input Testcase output Hi there! Hi! How can I help you? Hi there! Hi! How can I Help you? … complete conversations
  • 25. Testing chatbots 16/25 We use Botium and Rasa-test as the test suites to test the chatbots #me Hi there! #bot What day do you want to come in? #me GREET_UTTERANCES_USER #bot GREET_RESPONSES_USER Single test interaction Combination of multiple tests GREET_UTTERANCES_USER Hi there! Hi Hello Hey GREET_RESPONSES_USER Hi! How can I help you? Hello, what do you need? Greetings! This is the flight ticket assistant Antony, how can i help you? Multiple user utterances Possible responses convo file (conversation step) utterances responses
  • 26. Testing chatbots 17/25 Hi there! I need to fly from … Hi! How can I Help you? The price of the ticket … I lost my baggage Please, provide the flight ticket id … and complex conversations
  • 27. Mutation testing for chatbots 18/25 User says Action What kinds of coffee are available? What kinds of coffee can I order? What can I drink here? Tell me what drinks there are You can take an expresso or an americano User NL phrase Order a coffe intentn Chatbot match intent … Order a wine … chatbot response 3 extract params build response external service User says Action What kinds of wine are available? What kinds of wine can I order? What can I drink here? Tell me what drinks there are You can take an Italian wine or a French wine Intent matched Order a coffee Order a wine Tell me what kinds of coffee I can drink here
  • 28. Mutation testing for chatbots 18/25 User says Action What kinds of coffee are available? What kinds of coffee can I order? What can I drink here? Tell me what drinks there are You can take an expresso or an americano User NL phrase Order a coffe intentn Chatbot match intent … Order a wine … chatbot response 3 extract params build response external service User says Action What kinds of wine are available? What kinds of wine can I order? What can I drink here? Tell me what drinks there are You can take an Italian wine or a French wine 0.512 0.538 0.475 0.474 Tell me what kinds of coffee I can drink here Order a coffee: Keeps the two most different phrases Order a wine Semantic similarity
  • 29. Mutation testing for chatbots 18/25 User says Action What kinds of coffee are available? What kinds of coffee can I order? What can I drink here? Tell me what drinks there are You can take an expresso or an americano User NL phrase Order a coffe intentn Chatbot match intent … Order a wine … chatbot response 3 extract params build response external service Order a coffee: Keeps the two most different phrases User says Action What kinds of wine are available? What kinds of wine can I order? What can I drink here? Tell me what drinks there are You can take an Italian wine or a French wine Order a wine 0.512 0.538 0.475 0.474 Tell me what kinds of coffee I can drink here
  • 30. Mutation testing for chatbots 18/25 User says Action What can I drink here? Tell me what drinks there are You can take an expresso or an americano User NL phrase Order a coffe intentn Chatbot match intent … Order a wine … chatbot response 3 extract params build response external service User says Action What kinds of wine are available? What kinds of wine can I order? What can I drink here? Tell me what drinks there are You can take an Italian wine or a French wine Order a wine Intent matched Tell me what kinds of coffee I can drink here Order a coffee
  • 31. Mutation testing for chatbots 18/25 User says Action What can I drink here? Tell me what drinks there are You can take an expresso or an americano User NL phrase Order a coffe intentn Chatbot match intent … Order a wine … chatbot response 3 extract params build response external service User says Action What kinds of wine are available? What kinds of wine can I order? What can I drink here? Tell me what drinks there are You can take an Italian wine or a French wine Order a wine Intent matched Tell me what kinds of coffee I can drink here Order a coffee Test-suite
  • 32. Mutation testing for chatbots 18/25 User says Action What can I drink here? Tell me what drinks there are You can take an expresso or an americano User NL phrase Order a coffe intentn Chatbot match intent … Order a wine … chatbot response 3 extract params build response external service User says Action What kinds of wine are available? What kinds of wine can I order? What can I drink here? Tell me what drinks there are You can take an Italian wine or a French wine Order a wine Intent matched Tell me what kinds of coffee I can drink here Order a coffee
  • 33. Mutation testing for chatbots 18/25 User says Action What can I drink here? Tell me what drinks there are You can take an expresso or an americano User NL phrase Order a coffe intentn Chatbot match intent … Order a wine … chatbot response 3 extract params build response external service User says Action What kinds of wine are available? What kinds of wine can I order? What can I drink here? Tell me what drinks there are You can take an Italian wine or a French wine Order a wine Intent matched Tell me what kinds of coffee I can drink here Order a coffee Test-suite
  • 34. 19/25 Operators for training phrases DPmax Deletes the most representative phrase of an intent DPmin Deletes the most different phrase of an intent DPWP Deletes training phrases with required parameter DPWL Deletes training phrases with literal K2Pmax Keeps the 2 most representative phrases K2Pmin Keeps the 2 most different phrases MPmax Moves the most representative phrase to the most similar intent MPmin Moves the most different phrase to the most different intent Mutation operators for chatbots Operators for intents DIP Deletes intent parameter DPP Deletes parameter prompt SPO Sets required parameter to optional DFI Deletes fallback intent Operators for entities CRE Changes regular expression DLE Deletes literal from entity Operators for actions DA Deletes actions DPR Deletes a parameter used in a response SO Swaps outputs Operators for conversation flows DCS Deletes conversation step DCB Deletes conversation bifurcation Emulation of common errors of chatbot developers
  • 36. RQ1: How applicable are the defined mutation operators? RQ2: How effective are the defined mutation operators? 21/25 39% 48% 67% 60% 77% 73% 78% 80% 67% 0% 0% 40% 50% 76% 14% 89% 87% 96% Alive Killed Mutation score by mutation operator
  • 37. RQ1: How applicable are the defined mutation operators? RQ2: How effective are the defined mutation operators? 21/25 39% 48% 67% 60% 77% 73% 78% 80% 67% 0% 0% 40% 50% 76% 14% 89% 87% 96% Alive Killed Mutation score by mutation operator
  • 38. RQ3: How effective is the mutation testing process? 22/25 Botium automatic Botium by hand Rasa test 45% 94% 20% Alive Killed Mutation score by test suite kind
  • 39. RQ3: How effective is the mutation testing process? 22/25 Botium automatic Botium by hand Rasa test 45% 94% 20% Alive Killed Mutation score by test suite kind
  • 40. RQ4: How efficient is the mutation testing process? 23/25 0,1% 0,2% 0,3% 1,0% 1,2% 1,4% 1,6% 1,6% 1,7% 2,6% 4,9% 8,4% 12,8% 27,5% 34,7% 0% 5% 10% 15% 20% 25% 30% 35% Covid19_tracer bikeShop e2e-bot Spaceonova personal-bot yassinelamarti Rasa-demo 256644 h4h-chatbot diagrams2ai dusbot legal-alien-chatbot Email-WhatsApp-Integration lankbanfinance Data-mining The mutation testing process of 67% of the chatbots was completed in less than 90 minutes
  • 41. RQ4: How efficient is the mutation testing process? 23/25 0,1% 0,2% 0,3% 1,0% 1,2% 1,4% 1,6% 1,6% 1,7% 2,6% 4,9% 8,4% 12,8% 27,5% 34,7% 0% 5% 10% 15% 20% 25% 30% 35% Covid19_tracer bikeShop e2e-bot Spaceonova personal-bot yassinelamarti Rasa-demo 256644 h4h-chatbot diagrams2ai dusbot legal-alien-chatbot Email-WhatsApp-Integration lankbanfinance Data-mining The mutation testing process of 67% of the chatbots was completed in less than 90 minutes
  • 42. Conclusions • Technology-independent approach for MuT of chatbots with • A catalogue of 19 mutation operators for • Training phrases, intents, entities, chatbot actions and conversation flows • Support for test scenarios from botium and rasa-test • Experiment with 15 chatbots and 29 test suites • Positive results regarding applicability, effectiveness and efficiency • Room for improvement in 86% of the test suites • MuT for chatbots running times are costly but acceptable • Less than 90 minutes for 67% of the chatbots 24/25
  • 43. Future work • Automate the detection of semantically equivalent mutants • e.g., using confidence decrease heuristics • Automate the synthesis of tests able to kill the alive mutants • Adapt our approach to LLM-based agents 25/25
  • 44. www.uam.es Pablo Gómez-Abajo, Sara Pérez-Soler, Pablo C. Cañizares, Esther Guerra, Juan de Lara {Pablo.GomezA, Sara.PerezS, Pablo.Cerro, Esther.Guerra, Juan.deLara}@uam.es Mutation testing for task-oriented chatbots Thank you! ./ Wodel-Test Dataset Tool demo