From Group Science Project to Enterprise Service: Rethinking OpenTelemetry

By The New Stack

Share:

Key Concepts

  • Proactive Operations: A shift from reactive incident response to AI-driven, automated change control based on real-time telemetry.
  • OpenTelemetry (OTel): An open-source observability framework for generating, collecting, and exporting telemetry data (metrics, logs, and traces).
  • Observability: The ability to understand the internal state of a system by examining its external outputs. Moving beyond simply identifying that something is broken to understanding why.
  • Mean Time to Resolution (MTTR): A key metric in incident management, representing the average time taken to resolve an incident.
  • Hotel (OpenTelemetry as a Service): Utilizing OpenTelemetry as an ETL (Extract, Transform, Load) pipeline for telemetry data, either on the wire or within applications.
  • Platform Engineering: A team focused on building and maintaining internal developer platforms, including observability infrastructure.
  • If-This-Then-That (ITT): A fundamental logic construct used by My Decisive AI to automate responses to telemetry data changes.

The Shift from Reactive to Proactive Operations & the Role of My Decisive AI

The conversation centers around Ari Zilka of My Decisive AI and the company’s approach to observability, specifically moving beyond traditional reactive incident management towards proactive operations. Zilka argues that current observability tools, while providing dashboards and alerts, ultimately fail to significantly reduce MTTR because human intervention is still required at the critical decision point – determining whether to roll back or roll forward a change. He highlights that numerous CIOs have expressed frustration with high observability costs and a lack of tangible MTTR reduction.

My Decisive AI aims to solve this by acting as a “robot” that automates change control, making decisions based on real-time telemetry. Zilka emphasizes that the core problem isn’t identifying that something is broken (as current tools attempt), but rather identifying broken changes and automating the response. He states, “Nobody was reducing MTR significantly…by the time you’re broken, it’s too late. What you want to do is make sure that the humans aren’t in the loop making the changes and that you make every production change staring at your telemetry.”

Underlying Technology & OpenTelemetry Integration

My Decisive AI is built on a foundation of open-source technologies, including OpenTelemetry, Kubernetes, Prometheus, NATS, and HAProxy. This is a deliberate design choice, allowing the software to run on-premise, be self-managed by the user, and even be used without direct payment (the core service is open source). Zilka describes the service as “hotel as a service for on-prem.”

Hotel (OpenTelemetry as a Service) is explained as having two primary use cases: acting as an ETL box to transform telemetry data before it reaches observability vendors, and functioning as a replacement for legacy language agents within applications to collect telemetry. My Decisive AI leverages the “gateway mode” of OpenTelemetry, providing an enterprise-hardened, simplified interface for controlling telemetry data flow. Users can filter data, route it to different vendors, and even define complex “if-this-then-that” logic on the wire.

The Evolving Role of Platform Engineers & Observability Tools

Zilka discusses the changing role of platform engineers, highlighting a shift from simply managing vendor contracts (like New Relic and Data Dog) to actively programming and owning the observability infrastructure. He recounts an experience with a large streaming media company that built a 40-50 person team to manage their OpenTelemetry stack, demonstrating the growing importance of this role.

He argues that existing telemetry tools often reduce platform engineers to vendor management, while OpenTelemetry empowers them with greater control and ownership. However, he acknowledges the complexity of OpenTelemetry setup and maintenance, stating, “Hotel requires so much setup, programming and maintenance that can only be done in the application code.” My Decisive AI aims to bridge this gap by decoupling the complexity of OpenTelemetry from application code, allowing platform teams to leverage its power without requiring extensive developer involvement. He emphasizes the goal of making OpenTelemetry accessible to platform teams without burdening developers.

Automating Change Control with "If-This-Then-That" Logic

The core innovation of My Decisive AI lies in its “if-this-then-that” interface for OpenTelemetry. Instead of requiring developers to write complex pipelines for data filtering and routing, My Decisive AI discovers patterns and suggests automation opportunities.

Zilka explains, “It's got the right hooks, the right interface where a human doesn't have to join the loop.” The system learns from existing telemetry data and can automatically filter data, roll back changes based on error rate increases, or stop filtering data during incident investigations – all without requiring manual intervention. This automation is designed to reduce toil for platform engineers and empower developers without requiring them to become OpenTelemetry experts. He specifically notes a desire to move away from simply presenting data on a dashboard and instead automating responses to that data.

Business Model & Open Source Strategy

My Decisive AI’s core service is open-source and will be submitted to the CNCF (Cloud Native Computing Foundation). The license is a permissive GPL-based license, requiring attribution for commercial use but offering unrestricted use for end-user companies.

Currently, the company’s revenue model is based on subscription support. Zilka acknowledges a desire to eventually adopt a more common open-source/commercial add-on model, but states they are not there yet. The focus is currently on building a strong community and providing support to users.

Scalability & Future Considerations

My Decisive AI is designed to scale to large environments, with the ability to manage millions of OpenTelemetry instances as if they were a single, centralized system. Zilka also touches on the potential impact of emerging AI technologies within cloud-native environments, noting that My Decisive AI can filter and analyze data from these sources as well. He emphasizes the importance of being “on the wire” to capture data from all layers of the stack, from infrastructure to client applications.

Conclusion

My Decisive AI represents a significant shift in the observability landscape, moving beyond reactive monitoring to proactive, automated change control. By leveraging OpenTelemetry and a simple “if-this-then-that” interface, the company aims to empower platform engineers, reduce MTTR, and lower the cost of observability. The open-source nature of the core service and the focus on enterprise-grade scalability position My Decisive AI as a potentially disruptive force in the rapidly evolving world of cloud-native observability. The key takeaway is a move towards automated action based on telemetry, rather than simply visualization of telemetry data.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "From Group Science Project to Enterprise Service: Rethinking OpenTelemetry". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video