Back to all videos

AIE Singapore Day 2 ft. Google DeepMind, OpenClaw, Adaption, Arize, Cloudflare, Robot Company & more

By AI Engineer

their architecture evaluation tool calling data infrastructure

Share:

Key Concepts

Agentic Harnesses: The environment, toolchains, and guardrails surrounding an AI agent that ensure reliability and success.
Context Management: Strategic handling of data (truncation, compression, and memory) to prevent context window overflow while maintaining task focus.
Operational Design Domains (ODD): Defining the specific constraints and boundaries within which an AI system is tested and expected to perform.
Code Mode: A paradigm where models write code to interact with tools rather than making sequential, costly API calls, reducing latency and token usage.
V8 Isolates: A secure, lightweight, serverless runtime environment used for executing untrusted AI-generated code.
Synthetic Data Generation: Using AI to augment test sets, particularly for edge cases in mission-critical scenarios.
Agentic Telemetry/Observability: Real-time monitoring of agent behavior to identify root causes of failures (e.g., "agent-shaped problems").
Company Brain: A centralized, synthesized source of truth (context graph) that integrates disparate data sources (Slack, Notion, GitHub) for agents to operate reliably.

1. Building and Scaling AI Agents

The conference emphasized that building reliable agents requires moving beyond simple prompts.

Staying on Task: Agents often fail due to "attention problems" rather than hallucination. Planning is the solution: agents must explicitly define a to-do list (states: pending, blocked, in progress, completed) before executing actions.
Context Management: Strategies include "compressing values, not structure" (keeping JSON structure but truncating large strings) and using "large JSON" abstractions where data is stored in memory and referenced by ID.
Debugging: The feedback loop is critical. Teams are moving toward "Agentic IDEs" where agents read telemetry data (traces) to iterate on their own fixes.

2. Evaluation and Testing Frameworks

A major theme was the shift from "vibe checking" (manual testing) to structured, automated evaluation.

Production Traces as Ground Truth: Using real-world logs to create test cases is more effective than hand-writing golden answers.
Trajectory Tests: Using an LLM as a judge to assess the output of an agent step-by-step.
Scaling Evals: The challenge is avoiding "benchmaxing" (gaming benchmarks). The solution is defining specific ODDs and using smaller, deterministic models to verify data quality and identify artifacts in synthetic outputs.

3. Tool Calling and "Code Mode"

A significant technical shift discussed was the move from standard tool calling to Code Mode.

The Problem: Sequential tool calls are costly, slow, and bloat the context window.
The Solution: Models are trained on code and are inherently better at writing it. By providing TypeScript types as a library, models can write a single script that handles branching, loops, and parallelization, reducing token usage by up to 99.9%.
Execution: Untrusted code generated by models should be run in secure, lightweight sandboxes like V8 Isolates (e.g., Cloudflare Workers) to eliminate cold starts and ensure security.

4. The "Company Brain" and Data Infrastructure

The Problem: Documents are "lagging indicators" and often conflict. Connectors provide access, but not understanding.
The Solution: A Company Brain acts as a single source of truth. It ingests messy, real-time data (Slack, meetings, emails), normalizes it, resolves conflicts, and serves it to agents via a file-system-like interface. This allows organizations to learn recursively, where every agentic trace improves future performance.

5. Robotics and Physical AI

The Autonomy Gap: Robots perform well in labs but fail in the real world due to "out-of-distribution" scenarios.
Data Refinement: The industry is moving toward "data refineries" that use simulation to generate edge cases (e.g., glare, dynamic obstacles) to test models before real-world deployment.
Sensory Motor Learning: To achieve human-level physical intelligence, robots need to learn from human sensory data, including the missing modality of touch. Open Grab Labs is working on standardizing tactile data collection to bridge this gap.

6. Design and Human-AI Interaction

Design Systems: In the AI era, design systems are crucial for maintaining brand consistency. They provide the "guardrails" that prevent AI from generating "soulless" or off-brand content.
The IKEA Effect: Users are more invested in AI-generated work when they are collaborators rather than passive consumers.
Craft vs. Automation: AI is a "magic pencil"—it speeds up execution, but the "craft" (the tiny, intentional decisions) remains a human responsibility.

Synthesis/Conclusion

The conference marked a transition from the "hype" phase of AI to a "builder" phase. The consensus is that scaling model size is hitting a plateau; the future lies in adaptability, efficiency, and deterministic boundaries. Whether through agentic harnesses, code-based tool execution, or company-wide knowledge graphs, the focus has shifted to building reliable, repeatable, and secure systems that treat AI as a tool for human augmentation rather than a replacement for human judgment.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video