How We Built LangSmith Engine, Our Agent for Improving Agents

Last week we launched LangSmith Engine. Engine is an agent that sits on top of your agent traces, spots recurring issues, and suggests what to do next.

This post goes into the technical details of how we built it: why we built Engine, what inputs and outputs it works with, and the architecture decisions that let it analyze large volumes of traces.

Why we built Engine

LangSmith is the home of the agent improvement loop. Build, test, deploy, and monitor are the four pillars of this loop that power agent development.

As the number of agents you deploy grows, the number of traces they generate grows as well. As a result, you spend more and more time sorting through traces and figuring out where your agent went wrong.

Basic tool errors are relatively easy to catch. Overall trajectories are also visible from the trace view. But many agent issues are much harder to detect unless you inspect each trace at a granular level:

the agent loops through the same tool calls
it uses incorrect tool arguments
it executes inefficiently
it misses a tool it should have used
it fails the same kind of request repeatedly across different runs

After running into this problem internally at LangChain, we set out to build LangSmith Engine.

Engine has three jobs:

Find recurring failures in traces.
Turn those failures into actionable issues.
Convert those issues into durable improvements: evaluators, dataset examples, and fixes.

Engine is itself an agent: an orchestrator that uses specialized components to run the improvement loop end to end. It pulls traces, reads code when a repository is connected, groups failures into issues, proposes evaluators and dataset examples, and updates its understanding of your agent over time.

What Engine produces: issues

At its core, Engine identifies issues.

An issue is a recurring failure pattern, backed by evidence traces, with proposed follow-up actions. Issues are presented to the user in an Issue Board: a list of problems Engine has found in the tracing project.

An issue consists of:

Name: title of the issue
Description: paragraph description of the issue
Category: one of a predefined set of agent failure categories
Severity: low, medium, or high

การออกแบบ — pre-screen ที่อธิบายตัวเองได้

เราออกแบบ Sapien APPROVE ให้ทำหน้าที่ pre-screen ไม่ใช่ "อนุมัติ" เป้าหมายคือให้เจ้าหน้าที่เห็นภาพรวมของข้อเสนอภายใน 5 นาที แทนที่จะใช้ 38 นาที — โดยที่ตัวระบบไม่ได้ตัดสินใจสุดท้าย

ผลลัพธ์ของ pre-screen แต่ละครั้งคือเอกสารสรุปหนึ่งหน้า ประกอบด้วย:

รายการเอกสารที่ส่งครบ / ขาด พร้อมอ้างอิงข้อในระเบียบ
งบประมาณ — ตรวจ format และความสอดคล้องกับ scope ที่เสนอ
flagged sections — ข้อความที่อาจขัดกับเงื่อนไขทุน พร้อมเหตุผลและอ้างอิง
ระดับความมั่นใจของแต่ละ flag — สูง / กลาง / ต่ำ

สี่ pilot ที่ใช้กรอบนี้

ระหว่างปี 2024–2025 เราใช้กรอบนี้ออกแบบ 4 โครงการ pilot ในองค์กรที่มีลักษณะต่างกัน — มหาวิทยาลัย โรงพยาบาล สำนักงานราชการ และบริษัทเอกชน ทั้งหมดเริ่มต้นด้วย workshop หนึ่งวันเพื่อจัดงานเข้าสามชั้น

4pilot projects

12wk avg deploy time

68%งานย้ายชั้นใน 90 วันแรก

python

{ role: "human", chars: 142 }
{ role: "ai", latency_ms: 1820, chars: 89 }
{ role: "tool", tool_name: "search_db", latency_ms: 340, chars: 2100 }
{ role: "tool", tool_name: "search_db", latency_ms: 312, chars: 1980 }
{ role: "tool", tool_name: "search_db", latency_ms: 298, chars: 2040 }
{ role: "ai", latency_ms: 2100, chars: 210 }

Why we built Engine

What Engine produces: issues

การออกแบบ — pre-screen ที่อธิบายตัวเองได้

พัฒนา ทักษะ และ ระบบ ในองค์กรของคุณ

พัฒนา ทักษะ และ ระบบ
ในองค์กรของคุณ