Skip to content

Blog

How to perform LLM Evals

Evaluating LLM pipelines/workflows improvements is key to improving our AI systems. With limited time and resources, oftentimes the blocker is: overthinking them. In this article, I'll talk through a couple simple evals for benchmarking your improvements, based on work I've done previously.

The Curse of Overthinking

I've learnt that evals really can be as simple as an assert statement. The goal here is doing quick "smoke tests" to ensure that your pipeline is working as expected, whilst accounting for stochasticity. From that point, complexity is incrementally earned by structuring your evals around your most common & important failure modes.

If these failure modes aren't immediately apparent to you yet, then hunt for them first.

Working with Voice AI is Easy, Try This

Getting started with Voice AI can be easy. It's important to start simple to progressively build an understanding of what happens under the hood. By doing so, we can build a solid foundation for more complex applications. In addition, starting simple and adding complexity slowly helps us compare and appreciate the delta between demo-land & deployments into production.

Let's explore the following:

  1. Simple Speech-to-Text (STT)
  2. STT + Structured Outputs (e.g. JSON, Pydantic)
  3. Evals for Audio

For starters, let's keep it simple with a basic speech-to-text provider.

My Wild Ride On A Floating Factory

Growing up in Singapore, I knew nothing about offshore oil and gas. My dad used to work in maritime industries, and I'd zone out every time he talked about shipping and charts. Little did I know I'd one day find myself on a massive floating vessel that's basically a city in the middle of the ocean.

The one I went on recently was the FPSO One Guyana, the largest of its kind globally. As you can see here, even the word 'massive' doesn't do it justice when you compare it's size to the smaller boats beside it.

One Guyana FPSO