Charlene Chambliss | Full-Stack AI Engineer

Hex’s Notebook Agent, which we released earlier this year, is a huge hit. We get a good amount of user feedback, and had plenty of observability tooling for individual threads, but we were missing a sense of how frequent certain behaviors were. We knew various issues existed — the agent would sometimes get impatient with long-running cells, or mess up SQL syntax.

That was detectable enough via errors in DataDog, but how the agent would adapt to these occurrences was less well-understood.

I noticed that when I wanted to understand what exactly led to a certain result in an agent thread, I gravitated toward reading the agent’s thinking messages just before that result. That gave me an idea: what if we embedded every errored tool call’s follow-up reasoning, then clustered them to see what patterns emerged?

That’s exactly what we did. We used OpenAI’s small embedding model, subdivided messages by tool, ran K-Means to form clusters, and had an LLM generate human-readable labels for each cluster. Then we productionized the whole thing as a self-refreshing Hex app.

The analysis uncovered some interesting findings:

Dataframe vs. Warehouse mode confusion was a huge source of errors. The agent needed better prompts explaining when and how to use Dataframe SQL.
Cell timeouts were causing the agent to proliferate simpler versions of the same cell, making a mess. We softened our timeout language and taught the agent to check in with the user before simplifying.
Vague error messages caused the agent to completely bail out and switch strategies, rather than trying to recover. Clear, specific errors made a huge difference.

The best part: we built most of this tool using the Notebook Agent itself. It wrote the SQL query to pull the relevant tables and reasoning traces, plus pretty much all the embedding and analysis code. We just came up with the idea and supervised.

Read the full post on the Hex blog →