Generative Benchmarking
Representative Evals for Retrieval
This was the first report I published at Chroma, where we presented a method for generating representative evals from custom data. Our focus is on retrieval specifically, so the eval consists of a corpus and query-document pairs.
The main motivation behind this was that public benchmarks are not representative of real-world tasks and have mostly been memorized by models at this point. The idea, then, is to generate tasks from the same corpus that will be retrieved over in production, in a way that captures the style of queries that would actually be asked. We do this by:
- Filtering the corpus for chunks that would realistically be queried (i.e. filtering out random news content that a developer using a technical support bot would never ask about).
- Generating queries with few-shot examples to capture the style of realistic user queries.
The idea of generative benchmarking is not new, so we’re not claiming a novel method. The main focus here is representativeness: how well does our generative benchmarking method compare to public benchmarks? In collaboration with Weights and Biases, we used their corpus and real production queries to serve as the ground truth.
The experiments and results are detailed in the full report, but the TLDR is that our generative benchmarking method produced results much closer to the real production queries than public benchmarks did. One reason is that we use the same corpus, whereas public benchmarks like MedicalQA don’t necessarily capture the domain of Weights and Biases’ technical documentation. We also generate queries in a realistic way, meaning they are more ambiguous and contain fewer keyword matches. For instance, users are more likely to ask “artifact versioning not working” than “What is the purpose of artifact versioning in Weights and Biases?”.
Some personal thoughts
One of the key efforts in this report was manually labeling data and aligning our LLM judges. I often see work using LLM judges in evaluation without any human alignment, under the assumption that the judge is already well-aligned out of the box. On my first iteration of using an LLM judge for document filtering, I had 46% alignment, which is very low. It’s especially low for subjective judgments like “does the document contain information that developers using a technical support bot would realistically query?”, and probably less so for more straightforward LLM output evals like “does [model output] match [gold answer]?”. Regardless of the specific use case, I think this deserves more attention as LLM judges continue to be used more widely in evals.
What I think is most valuable here is the emphasis on quantifying representativeness. Our evals are only as valuable as how representative they are of real use cases, and that feels like one of the major gaps in evals today. I’d love to see more work in this direction beyond retrieval.
Another interesting direction is evolving evals. In real use cases, you have production traffic, which is valuable data in itself. We could use it to improve our golden dataset: for instance, by embedding and clustering both golden queries and production queries together, we can identify topical gaps in the golden set and reshape it to match the topical distribution of production queries. The same approach could be applied to the corpus itself: if users are consistently asking about a topic that isn’t covered, you should probably add that in.
Featured in
- TWIML podcast
- Jason Liu’s RAG course