How to perform LLM Evals
Evaluating LLM pipelines/workflows improvements is key to improving our AI systems. With limited time and resources, oftentimes the blocker is: overthinking them. In this article, I'll talk through a couple simple evals for benchmarking your improvements, based on work I've done previously.
The Curse of Overthinking
I've learnt that evals really can be as simple as an assert
statement. The goal here is doing quick "smoke tests" to ensure that your pipeline is working as expected, whilst accounting for stochasticity. From that point, complexity is incrementally earned by structuring your evals around your most common & important failure modes.
If these failure modes aren't immediately apparent to you yet, then hunt for them first.