hello aszen, I work with draismaa, the way we have developed our simulations is by putting a few agents in a loop to simulate the conversation:
- the agent under test
- a user simulator agent, sending messages as a user would
- a judge agent, overlooking and stopping the simulation with a verdict when achieved
it then takes a description of the simulation scenario, and a list of criteria for the judge to eval, and that's enough to run the simulation
this is allowing us to tdd our way into building those agents, like, before adding something to the prompt, we can add a scenario/criteria first, see it fail, then fix the prompt, and see it playing out nicely (or having to vibe a bit further) until the test is green
we put this together in a framework called Scenario:
the cool thing is that we also built in a way to control the simulation, so you can go as flexible as possible (just let it play out on autopilot), or define what the user said, mock what agent replied and so on to carry on a situation
and then in the middle of this turns we can throw in any additional evaluation, for example checking if a tool was called, it's really just a simple pytest/vitest assertion, it's a function callback so any other eval can also be called
We are just starting to introduce AI and for now rely on simple evals as unit tests that Dev's run locally to fine tune prompts and context.
Your idea of simulating agent interactions is interesting, but I want to know how are you actually evaluating simulation runs?
hello aszen, I work with draismaa, the way we have developed our simulations is by putting a few agents in a loop to simulate the conversation:
- the agent under test - a user simulator agent, sending messages as a user would - a judge agent, overlooking and stopping the simulation with a verdict when achieved
it then takes a description of the simulation scenario, and a list of criteria for the judge to eval, and that's enough to run the simulation
this is allowing us to tdd our way into building those agents, like, before adding something to the prompt, we can add a scenario/criteria first, see it fail, then fix the prompt, and see it playing out nicely (or having to vibe a bit further) until the test is green
we put this together in a framework called Scenario:
https://github.com/langwatch/scenario
the cool thing is that we also built in a way to control the simulation, so you can go as flexible as possible (just let it play out on autopilot), or define what the user said, mock what agent replied and so on to carry on a situation
and then in the middle of this turns we can throw in any additional evaluation, for example checking if a tool was called, it's really just a simple pytest/vitest assertion, it's a function callback so any other eval can also be called