Human Cognition Can’t Keep Up with Modern Networks. What’s Next?

Key Concepts

Agenic AI: AI systems designed to act autonomously on behalf of users, particularly in operational contexts.
Time Series Foundation Models: AI models specifically trained on time-stamped data, crucial for understanding network behavior.
Knowledge Graphs: Structured representations of knowledge, used to encode institutional expertise and provide context for AI reasoning.
Ontologies: Formal descriptions of concepts and relationships within a domain, defining rules and constraints for AI agents.
Hybrid Cloud, Data, AI, and Automation: IBM’s overarching strategy, emphasizing the integration of these elements.
Data Silos & Fragmentation: The challenge of disparate data sources and inconsistent data models hindering AI effectiveness.
Trust in AI Operations: A critical barrier to adoption, stemming from the potential for costly errors and lack of transparency.
LLM Scaffolding: The process of augmenting Large Language Models with specialized components (like time series models) to improve accuracy and reliability.

The Challenges Facing Network Operations Teams

Network operations teams are facing increasing complexity due to modern, distributed, and software-defined networks. This complexity leads to a low signal-to-noise ratio in data, making it difficult to identify and resolve issues. Specifically, Sanil Nambier identified five key challenges:

Rising Complexity: Modern networks have outstripped human cognitive abilities.
Low Signal-to-Noise Ratio: The sheer volume of data makes it difficult to discern meaningful information.
Data Silos & Fragmentation: Data resides in disparate systems with inconsistent models (e.g., SNMP using MIPS, configuration data using Yang models), hindering holistic understanding.
Expertise Barrier: Sophisticated monitoring tools require specialized knowledge, and valuable tribal knowledge held by senior engineers is difficult to transfer.
Lack of Trust in AI: The high cost of errors in production environments creates reluctance to rely on AI recommendations. This trust issue is rooted in data silos, fragmentation, and lack of real-time data access.

IBM’s Strategy and Network Intelligence

IBM’s strategy centers around the integration of hybrid cloud, data, AI, and automation. This is underpinned by three foundational platforms: Red Hat (consistent runtime environment), HashiCorp (lifecycle control and policy-driven automation), and a planned acquisition of Confluent (real-time data access).

IBM Network Intelligence, launched in September, is a product built on this strategy. It leverages time series foundation models and agentic AI to address the challenges faced by network operations teams. The core principle is to provide AI with access to real-time data with context, rather than relying on isolated data points.

How Agenic AI Addresses Key Issues

Agenic AI, as implemented in IBM Network Intelligence, can address the identified challenges in several ways:

Data Silos: AI agents can act as “context stitchers,” pulling data from various sources (metrics, alarms, topology, configuration, tickets) to create a coherent picture. This is most effective when grounded in structured data and knowledge graphs.
Data Fragmentation: Knowledge graphs and ontologies are crucial for aligning disparate data models. Ontologies define rules and constraints, allowing agents to reason deterministically based on domain expertise. For example, an ontology for a telco network would specify that a cell site must have a frequency band.
Expertise Barrier: By encoding institutional knowledge into knowledge graphs, AI agents can replicate the reasoning of experienced engineers, making that expertise accessible to a wider range of personnel.
Accelerating Resolution Times: Agents can automate the initial stages of incident response by assembling timelines, identifying impacted services, and suggesting potential solutions based on historical data. This frees up senior engineers to focus on more complex issues.
Shifting Left & Repurposing Engineers: Agenic AI can enable junior engineers to handle more complex tasks, while allowing senior engineers to focus on planning, design, and building better systems.

The Role of Knowledge Graphs and Ontologies

Knowledge graphs are central to IBM’s approach. They provide a structured representation of relationships between network elements, allowing AI agents to understand why things are connected, not just that they are connected. Ontologies define the rules and constraints within the knowledge graph, ensuring that the AI reasoning is grounded in domain expertise. This is critical for building trust and avoiding erroneous recommendations.

Limits of Agenic AI and the Need for Human-in-the-Loop

Despite its potential, agenic AI has limitations. Sanil Nambier emphasized the importance of avoiding over-selling autonomy. Humans are still needed in the loop for:

High-Risk Actions: Traffic rerouting, configuration changes, and failovers require human approval.
Ambiguous Situations: When multiple root causes are plausible or data is incomplete, human judgment is essential.
Complex Scenarios: Issues spanning technical and business domains often require human oversight.

Furthermore, deep observability of the AI agents themselves is crucial for monitoring performance, identifying biases, and continuously improving the system.

IBM Network Intelligence in Action

IBM Network Intelligence is designed to leverage the principles discussed. It combines time series foundation models (for understanding temporal patterns) with reasoning LLMs (Large Language Models) and agentic AI to provide actionable insights. The architecture is designed for accuracy, trust, and scalability.

Data, Research Findings, and Statistics

While specific statistics weren’t provided, the conversation highlighted the following:

The increasing complexity of networks is outpacing human cognitive abilities.
Most major outages are preceded by subtle operational patterns that are often missed by traditional monitoring tools.
The cost of being wrong with AI recommendations in production environments is very high (potential for outages, SLA breaches).
Burnout is a significant operational risk due to the concentration of expertise in a small number of engineers.

Conclusion

Agenic AI holds significant promise for transforming network operations, but its success hinges on a well-architected approach that prioritizes trust, data integration, and human oversight. IBM’s strategy, centered around hybrid cloud, data, AI, and automation, and its implementation in IBM Network Intelligence, aims to address the key challenges facing network operations teams by providing AI with the context and knowledge it needs to operate effectively. The key takeaway is that AI is not a replacement for human expertise, but a tool to augment it, enabling engineers to focus on higher-value tasks and build more resilient and efficient networks.