Skip to content

LLMs

Using Language Models to make sense of Chat Data without compromising user privacy

If you're interested in the code for this article, you can find it here where I've implemented a simplified version of CLIO without the PII classifier and most of the original prompts ( to some degree ).

Analysing chat data at scale is a challenging task for 3 main reasons

  1. Privacy - Users don't want their data to be shared with others and we need to respect that. This makes it challenging to do analysis on user data that's specific
  2. Explainability - Unsupervised clustering methods are sometimes difficult to interpret because we don't have a good way to understand what the clusters mean.
  3. Scale - We need to be able to process large amounts of data efficiently.

An ideal solution allows us to understand broad general trends and patterns in user behaviour while preserving user privacy. In this article, we'll explore an approach that addresses this challenge - Claude Language Insights and Observability ( CLIO ) which was recently discussed in a research paper released by Anthropic.

We'll do so in 3 steps

  1. We'll start by understanding on a high level how CLIO works
  2. We'll then implement a simplified version of CLIO in Python
  3. We'll then discuss some of the clusters that we generated and some of the limitations of such an approach

Let's walk through these concepts in detail.

Getting Started with Language Models in 2025

After a year of building AI applications and contributing to projects like Instructor, I've found that getting started with language models is simpler than most people think. You don't need a deep learning background or months of preparation - just a practical approach to learning and building.

Here are three effective ways to get started (and you can pursue all of them at once):

  1. Daily Usage: Put Claude, ChatGPT, or other LLMs to work in your daily tasks. Use them for debugging, code reviews, planning - anything. This gives you immediate value while building intuition for what these models can and can't do well.

  2. Focusing on Implementation: Start with Instructor and basic APIs. Build something simple that solves a real problem, even if it's just a classifier or text analyzer. The goal is getting hands-on experience with development patterns that actually work in production.

  3. Understand the Tech: Write basic evaluations for your specific use cases. Generate synthetic data to test edge cases. Read papers that explain the behaviors you're seeing in practice. This deeper understanding makes you better at both using and building with these tools.

You should and will be able to do all of these at once. Remember that the goal isn't expertise but to discover which aspect of the space you're most interested in.

There's a tremendous amount of possible directions to work on - dataset curation, model architecture, hardware optimisation, etc and other exiciting directions such as Post Transformer Architectures and Multimodal Models that are happening all at the same time.

Simplify your LLM Evals

Although many tasks require subjective evaluations, I've found that starting with simple binary metrics can get you surprisingly far. In this article, I'll share a recent case study of extracting questions from transcripts. We'll walk through a practical process for converting subjective evaluations to measurable metrics:

  1. Using synthetic data for rapid iteration - instead of waiting minutes per test, we'll see how to iterate in seconds
  2. Converting subjective judgments to binary choices - transforming "is this a good question?" into "did we find the right transcript chunks?"
  3. Iteratively improving prompts with fast feedback - using clear metrics to systematically enhance performance

By the end of this article, you'll have concrete techniques for making subjective tasks measurable and iterating quickly on LLM applications.

Is there any value in a wrapper?

I'm writing this as I take a train from Kaohsiung to Taipei, contemplating a question that frequently surfaces in AI discussions: Could anyone clone an LLM application if they had access to all its prompts?

In this article, I'll challenge this perspective by examining how the true value of LLM applications extends far beyond just a simple set of prompts.

We'll explore three critical areas that create sustainable competitive advantages

  1. Robust infrastructure
  2. Thoughtful user experience design
  3. Compounding value of user-generated data.

What it means to look at your data

People always talk about looking at your data but what does it actually mean in practice?

In this post, I'll walk you through a short example. After examining failure patterns, we discovered our query understanding was aggressively filtering out relevant items. We improved the recall of a filtering system that I was working on from 0.86 to 1 by working on prompting the model to be more flexible with its filters.

There really are two things that make debugging these issues much easier

  1. A clear objective metric to optimise for - in this case, I was looking at recall ( whether or not the relevant item was present in the top k results )
  2. A easy way to look at the data - I like using braintrust but you can use whatever you want.

Ultimately debugging these systems is all about asking intelligent questions and systematically hunting for failure modes. By the end of the post, you'll have a better idea of how to think about data debugging as an iterative process.

Taming Your LLM Application

This is an article that sums up a talk I'm giving in Kaoshiung at the Taiwan Hackerhouse Meetup on Dec 9th. If you're interested in attending, you can sign up here

When building LLM applications, teams often jump straight to complex evaluations - using tools like RAGAS or another LLM as a judge. While these sophisticated approaches have their place, I've found that starting with simple, measurable metrics leads to more reliable systems that improve steadily over time.

Five levels of LLM Applications

I think there are five levels that teams seem to progress through as they build more reliable language model applications.

  1. Structured Outputs - Move from raw text to validated data structures
  2. Prioritizing Iteration - Using cheap metrics like recall/mrr to ensure you're nailing down the basics
  3. Fuzzing - Using synthetic data to systmetically test for edge cases
  4. Segmentation - Understanding the weak points of your model
  5. LLM Judges - Using LLM as a judge to evaluate subjective aspects

Let's explore each level in more detail and see how they fit into a progression. We'll use instructor in these examples since that's what I'm most familiar with, but the concepts can be applied to other tools as well.

Are your eval improvements just pure chance?

A step that's often missed when benchmarking retrieval methods is determining if any performance difference is due to random chance. Without this crucial step, you might invest in a new system upgrade that's outperformed your old one by pure chance.

If you're comparing retrieval methods, you'll often want to know if the improvements you're seeing are due to random chance.

In this article, we'll use a simple case study to demonstrate how to answer this question, introducing a new library called indomee (a playful nod to both "in-domain" evaluation and the beloved instant noodle brand in Southeast Asia) that makes this analysis significantly easier.

We'll do so in three steps:

  1. First we'll simulate some fake data using numpy
  2. Then we'll demonstrate how to do bootstrapping using nothing but numpy before visualising the results with matplotlib
  3. Finally we'll perform a paired t-test to determine if the differences are statistically significant

You're probably using LLMs wrongly

In this post, I'll share three simple strategies that have transformed how I work with language models.

  1. Treat working with language models as an iterative process where you improve the quality of your outputs over time.
  2. Collect good examples that you can use as references for future prompts.
  3. Regularly review your prompts and examples to understand what works and what doesn't.

Most complaints I hear about language models are about hallucinations or bad outputs. These aren't issues with the technology itself. It's usually because we're not using these models the right way. Think about the last time you hired someone new. You didn't expect them to nail everything perfectly on day one.

The same principle applies to language models.

Synthetic Data is no Free Lunch

I spent some time playing with a new framework called Dria recently that uses LLMs to generate synthetic data. I couldn't get it to work but I did spend some time digging through their source code, and I thought I'd share some of my thoughts on the topic.

Over the past few weeks, I've generated a few million tokens of synthetic data for some projects. I'm still figuring out the best way to do it but I think it's definitely taught me that it's no free lunch. You do need to spend some time thinking about how to generate the data that you want.

The Premise

An example

When I first started generating synthetic data for question-answering systems, I thought it would be straightforward - all I had to do was to ask a language model to generate a few thousand questions that a user might ask.

You're probably not doing experiments right

I recently started working as a research engineer and it's been a significant mindset shift in how I approach my work. it's tricky to run experiments with LLMs efficiently and accurately and after months of trial and error, I've found that there are three key factors that make the biggest difference

  1. Being clear about what you're varying
  2. Investing time to build out some infrastructure
  3. Doing some simple sensitivity analysis

Let's see how each of these can make a difference in your experimental workflow.