Breaking Data Team Silos Is the Key to Getting AI to Production

Key Concepts

Observability: Monitoring and understanding the internal state of systems (applications, infrastructure, models) to identify and resolve issues. Moving beyond traditional monitoring to focus on why things happen, not just that they happened.
AI Agents: Autonomous entities powered by Large Language Models (LLMs) capable of performing tasks and making decisions.
OpenTelemetry: An open-source observability framework for collecting and exporting telemetry data (metrics, logs, traces).
LLMs (Large Language Models): Powerful AI models capable of understanding and generating human language.
Siloed Teams: Lack of communication and collaboration between data science/ML teams and DevOps/operations teams.
Hallucinations/Drift (in AI models): Hallucinations refer to the model generating incorrect or nonsensical information. Drift refers to the degradation of model performance over time due to changes in input data.
SLOs (Service Level Objectives): Targets for service performance (e.g., latency, error rate).
KPIs (Key Performance Indicators): Measurable values that demonstrate how effectively a company is achieving key business objectives.
UEM (User Experience Management): Monitoring and optimizing the end-user experience with applications.

From Pilot to Production: Navigating AI Observability Challenges

This discussion centers on the challenges of transitioning AI applications, particularly those leveraging AI agents and LLMs, from experimental pilot phases to robust production environments. The core theme revolves around the critical need for observability – not just as an afterthought, but as a foundational element of AI deployment.

The Current Landscape & Roadblocks

The speakers, Thanos and Martin from IBM, highlight that AI implementation is still in its early stages for most organizations. A significant roadblock is the historical separation between data science/ML teams and DevOps/operations teams. Data teams traditionally operate in silos, focusing on model development and evaluation, while operations teams manage infrastructure and application performance. This division creates friction when AI applications move to production, as operations teams lack visibility into the model's internal workings and require support for new technologies like Bedrock and SageMaker.

A key issue is the difficulty in measuring what's happening inside the AI model itself. Traditional metrics like latency are insufficient; assessing issues like hallucinations or model drift requires more sophisticated techniques, often involving using other models to analyze the AI’s output. This often leads to observability being an afterthought, making it harder to demonstrate value when stakes are high.

The Cultural Shift & Historical Parallels

The conversation draws parallels to the adoption of tracing in traditional observability. Initially, developers and operations teams clashed over access and control of tracing data. However, the rise of APM (Application Performance Management) and unified observability platforms fostered collaboration and shared data access. The speakers believe a similar cultural shift is needed for AI, breaking down silos and encouraging cross-team collaboration. The core argument is that AI is forcing this change, as models now directly impact revenue and customer experience, increasing the pressure for operational visibility. As Martin states, “AI is trying to break that down, right? And it's always going to be hard and it's always going to take time.”

Best Practices for Getting Started

For companies just beginning with AI, the speakers recommend focusing on the fundamentals of observability. Don't abandon established practices. Specifically:

Establish Baseline Metrics & KPIs: Define clear metrics and KPIs to measure the performance of AI applications and their impact on business objectives.
Monitor the Supporting Infrastructure: Continue to monitor databases, APIs, and other infrastructure components that support the AI application.
Leverage OpenTelemetry: The speakers emphasize the positive impact of AI service providers (like AWS Bedrock and SageMaker) embracing OpenTelemetry, providing a standardized way to collect and export telemetry data. This simplifies integration with existing observability tools.
Focus on Trust: Building trust in AI is paramount. Observability plays a crucial role in demonstrating the reliability and accuracy of AI models.

The Unique Challenges of AI Observability

While traditional observability principles apply, AI introduces unique challenges:

Non-Determinism: Unlike traditional applications, AI models are not deterministic. Their outputs can vary even with the same inputs, making it harder to diagnose issues.
User Feedback Reliance: Assessing the quality of AI outputs often requires human feedback, adding complexity to the monitoring process.
Business Value Measurement: Determining the business value generated by AI models is challenging, requiring new metrics and approaches.
Security & Compliance: Data privacy and security are critical concerns, requiring robust access controls and audit logging.

Emerging Trends & Future Outlook

The discussion highlights several emerging trends:

AI-Powered Observability: Using AI to automate root cause analysis, summarize incidents, and provide insights from observability data. IBM is already utilizing AI for automated root cause analysis and incident summarization.
Business-Focused Observability: Translating technical observability data into business-relevant insights for stakeholders. The ideal scenario, as described by Martin, is a daily report summarizing failures, resolution times, revenue impact, and customer impact.
The Need for New Skills: The observability landscape is evolving, requiring professionals to develop new skills in AI, machine learning, and data analysis.
The Rise of AI SR Agents: AWS’s AI Service Representative (SR) agent is an example of a tool that can help automate tasks and improve efficiency.

Notable Quotes

Thanos: “Unfortunately, it’s an afterthought at this point…which makes our life a little bit harder to be able to showcase our value.” (Regarding observability being implemented late in the AI lifecycle)
Martin: “It always comes down to culture at the end of the day, it seems like.” (Highlighting the importance of breaking down silos)
Martin: “If you don't trust the AI and you don't trust the underlying infrastructure, it's dead on the water, right?” (Emphasizing the importance of trust in both the AI model and the supporting infrastructure)

Data & Statistics (Implied)

While no specific numbers were cited, the conversation implies:

Rapid Growth in AI Adoption: The increasing demand for AI solutions is driving the need for robust observability practices.
High Cost of AI Infrastructure: The expense of GPUs and TPUs suggests that organizations will be more cautious about AI adoption and prioritize careful planning and monitoring.

Conclusion

The conversation underscores that successful AI deployment requires a proactive and holistic approach to observability. Breaking down silos between data science and operations teams, focusing on fundamental observability principles, and embracing emerging AI-powered observability tools are crucial steps. While challenges remain, the speakers express optimism that the industry will learn from past experiences and develop best practices for navigating the complexities of AI observability, ultimately leading to more reliable, trustworthy, and valuable AI applications. The key takeaway is that observability isn’t just about monitoring; it’s about building trust and ensuring that AI delivers on its promise.