Vibe-testing our AI agent sucked Here's what we did

(benchx.io)

3 points | by mesius 2 days ago ago

2 comments

sinazargaran 2 days ago
We've had a lot of problems continuously testing every iteration of our conversational agents. Our agents are more entertainment focused, so it's more difficult to evaluate them with a framework. Haven't come across any benchmarks for our use case. Is that something you're also considering?
[-]
- mesius 2 days ago
  It all comes down to making your own dataset. Have you looked at Langsmith, langfuse? they have a ui for building datasets out of production traces. but we are taking it one step further and letting you define mock databases, mock apis, etc.