Skill issue: Lessons from skilling up coding agents to use Langfuse - Marc Klingen, Clickhouse

By AI Engineer

Share:

Key Concepts

  • Langfuse: An open-source observability and evaluation infrastructure for LLM applications.
  • Agent Skills: Formalized, modular shortcuts that allow AI agents to perform specific tasks (e.g., setting up observability) reliably, replacing complex, manual workflows.
  • Tracing: The process of recording the execution path of an agent to identify runtime issues and discover new use cases.
  • Target Function: The objective or metric used to optimize an agent's behavior; critical for ensuring the agent achieves the desired outcome without taking shortcuts.
  • RAG (Retrieval-Augmented Generation): Used here to provide agents with up-to-date documentation and context to prevent hallucinations.
  • Evaluation (Evals): The process of measuring the performance and reliability of LLM outputs.

1. The Evolution of Agentic Workflows

Mark argues that the industry has moved past the "workflow vs. autonomous agent" debate. Instead, the focus is on using Skills as a bridge. Historically, developers built rigid, reliable workflows for specific tasks (e.g., password resets). However, these struggle with multi-domain requests. Modern agents can progressively gather context to solve complex, multi-domain problems that would have previously required multiple, disconnected workflows.

2. The "Skill" Framework for Langfuse

Langfuse developed a "Skill" to help users integrate their infrastructure into projects. The goal is to provide every user with an "expert" that understands best practices for observability and evaluation.

  • Methodology: The skill uses a skill.md file to define style and behavior (e.g., "ask follow-up questions before acting").
  • Tooling: It leverages the Langfuse CLI, allowing agents to perform actions that previously required human interaction with the UI.
  • Documentation Access: Instead of relying on outdated pre-training data, the skill uses a search endpoint that queries live documentation, ensuring the agent uses the most current API methods.

3. Key Learnings from Scaling Agent Skills

Mark shared six primary takeaways from building and deploying these skills:

  1. Trace Analysis: Looking at execution traces provides 80% of the necessary insight into how an agent is performing and where it is failing.
  2. Production Signals: Exposing a search endpoint for documentation allows the team to track what users (or agents) are searching for, identifying gaps in documentation.
  3. Navigation Guidance: Agents often struggle to navigate large documentation sets. Providing a "sitemap" or specific search endpoints prevents the agent from looping through irrelevant pages.
  4. Basic Evals are Essential: Even a simple evaluation setup (e.g., checking for the presence of specific spans in a trace) is significantly better than having no evaluation at all.
  5. Reference Dynamic Content: Avoid duplicating documentation within the skill itself, as it quickly becomes stale. Always point to the live reference.
  6. Target Functions Matter: The objective given to an agent dictates its success. If the target function is "minimize turns," the agent may skip critical steps like fetching updated documentation.

4. Notable Quotes

  • "In the end, what do you need when agents do all of this? You only need the infrastructure piece if agents can then customize for different workflows."
  • "Looking at traces still gets you to 80% of the detail."
  • "The target function really matters... if you basically ask to minimize the number of turns, then the agent... just took out all the notes that we had to fetch documentation."

5. Real-World Application: Prompt Management

The team experimented with using an agent to migrate prompts from local git repositories into Langfuse’s managed prompt system. By defining a target function and iterating, they successfully automated a complex migration process that would have been time-consuming for a small team to handle manually.

6. Synthesis and Conclusion

The transition from manual documentation reading to "agent-led" implementation is a major shift in developer experience. Langfuse’s approach demonstrates that for infrastructure tools, the most effective strategy is to provide an unopinionated, API-first foundation that agents can interact with via specialized skills.

Main Takeaways:

  • Infrastructure as Code: Agents should be treated as the primary interface for interacting with complex infrastructure.
  • Avoid Stale Context: Always prioritize live retrieval (RAG) over pre-training data for technical documentation.
  • Iterative Optimization: Use traces to identify where agents fail and refine the "target function" to ensure they follow best practices rather than just taking the fastest path to completion.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video