We've had a lot of problems continuously testing every iteration of our conversational agents. Our agents are more entertainment focused, so it's more difficult to evaluate them with a framework. Haven't come across any benchmarks for our use case. Is that something you're also considering?
It all comes down to making your own dataset. Have you looked at Langsmith, langfuse? they have a ui for building datasets out of production traces. but we are taking it one step further and letting you define mock databases, mock apis, etc.
We've had a lot of problems continuously testing every iteration of our conversational agents. Our agents are more entertainment focused, so it's more difficult to evaluate them with a framework. Haven't come across any benchmarks for our use case. Is that something you're also considering?
It all comes down to making your own dataset. Have you looked at Langsmith, langfuse? they have a ui for building datasets out of production traces. but we are taking it one step further and letting you define mock databases, mock apis, etc.