Meet Gravitino, a geo-distributed, federated metadata lake

By The New Stack

Share:

Key Concepts

  • Apache Gravitino: A high-performance, geodistributed, federated metadata lake designed to unify metadata access control and governance across multi-engine and multi-cloud environments.
  • Catalog of Catalogs: Gravitino’s approach to metadata management, aggregating metadata from various data platforms instead of replacing existing catalogs.
  • Technical Metadata: Metadata focused on engine and agent consumption, including schema information and access details, as opposed to business-focused metadata.
  • Agentic AI: The emerging paradigm of AI systems that proactively discover and utilize data, requiring a different approach to metadata management than traditional BI.
  • Multimodality Lakehouse: A data architecture supporting diverse data formats (structured, unstructured, vector data) and multiple compute engines.
  • Control Plane: Gravitino functions as a centralized control plane for metadata, enabling consistent governance and access control.

Data Strato & Apache Gravitino: A Deep Dive into Unified Metadata Management

Introduction

This discussion centers on Apache Gravitino, a high-performance federated metadata lake, and its role in modern data and AI infrastructure. JP Dub, founder and CEO of Data Strato (the company behind Gravitino), explains the project’s origins, functionality, and future direction, particularly within the context of rapidly evolving AI technologies. Gravitino graduated as an Apache top-level project in June, released its first stable version (1.1) in December, and joined the Agentic AI Foundation in early 2026.

The Problem Gravitino Solves

The core problem Gravitino addresses is the increasing complexity of managing metadata across diverse data platforms. Prior to Gravitino, data catalogs (like Apache Atlas and solutions from vendors like Databricks and Snowflake) were primarily designed for Business Intelligence (BI) use cases and lacked the engine-friendly, agent-centric metadata required for modern AI workloads. Specifically, the challenges include:

  • Data Silos: Data residing in multiple engines (Spark, Trino, Ray, PyTorch) and cloud environments, leading to inconsistent metadata and governance.
  • Catalog Silos: Existing data catalogs being tied to specific platforms, hindering unified access control and governance.
  • Unstructured Data Governance: Difficulty managing and governing unstructured data, particularly as it becomes increasingly important for AI applications.
  • Lack of a Centralized Control Plane: The absence of a single, authoritative source for defining data existence, access permissions, and governance policies.

As JP Dub stated, “before Gravitino…metadata or sometimes we call it a catalog, it’s lived in a siloed catalog instead of a unified catalog to know everything.”

How Gravitino Works: The "Catalog of Catalogs" Approach

Gravitino differentiates itself through its “catalog of catalogs” approach. Instead of replacing existing data catalogs, it integrates with them, creating a unified metadata layer. This is achieved by:

  1. Hooking into Data Systems: Gravitino connects directly to various data systems (databases, data warehouses, data lakes, file systems) and sub-catalogs.
  2. Metadata Collection: It actively or reactively collects metadata (schema information, access controls, etc.) from these sources.
  3. Unified Data Catalog: It builds a unified catalog that encompasses structured, semi-structured, and unstructured data formats (including Lens and vector data).
  4. Engine Access: Multiple engines can access this unified metadata, eliminating data silos and enabling seamless data consumption by AI agents.

This architecture allows Gravitino to provide a single, engine-neutral control plane for metadata and governance.

Gravitino’s Origin Story

Gravitino originated from practical, recurring problems faced by the founders (Jerry and JP Dub) during their combined 30+ years of experience building data infrastructure. They observed that as data and AI systems grew in complexity, teams consistently struggled with data silos, inconsistent metadata, and difficulties in applying governance policies. The need for a unified metadata solution became increasingly apparent, leading to the development of Gravitino.

Technical Implementation & Architecture

Gravitino is primarily built in Java, leveraging the Java ecosystem’s maturity and performance characteristics. However, it also provides Python clients for ease of integration with AI frameworks like PyTorch, Ray, and Daft. The core of Gravitino is a high-performance server designed for concurrent access and multi-tenancy.

Key technical aspects include:

  • Apache Gravitino: The core open-source project.
  • Apache Iceberg Support: Gravitino provides a REST catalog for Iceberg, addressing the need for centralized access control and operational management.
  • Vector Data Support: Native support for vector databases and vector data formats.
  • Physical Metadata: Beyond logical metadata (schemas), Gravitino also manages physical metadata related to indexing, caching, and table maintenance.

Use Cases & Applications

Gravitino is particularly valuable for:

  • Cloud Data Consolidation: Unifying metadata across multiple clouds and regions, enabling consistent data access for distributed compute resources. A large US internet technology company is using Gravitino for this purpose.
  • Lakehouse Service Building: Providing a control plane for Apache Iceberg, addressing its limitations in access control, multi-tenancy, and operational management.
  • Multimodality Lakehouse: Managing diverse data formats (structured, unstructured, vector data) and multiple compute engines in a unified environment.
  • Agentic AI Workloads: Facilitating data discovery and access for AI agents, enabling proactive data utilization.

Gravitino vs. Existing Solutions

Gravitino differentiates itself from existing solutions (platform-native catalogs like Snowflake Polaris and open-source projects like DataHub and Amundsen) in several key ways:

  • Vendor Neutrality: Unlike platform-native catalogs, Gravitino is engine-neutral and vendor-neutral, providing a unified view across diverse environments.
  • Engine-Centric Metadata: Gravitino focuses on providing technical metadata that is directly consumable by engines and agents, unlike many existing catalogs that prioritize business-focused metadata.
  • Multimodality Support: Gravitino is designed to natively support diverse data formats, including unstructured and vector data, while many existing solutions primarily focus on structured data.
  • Runtime Enforcement: Gravitino provides runtime enforcement of governance policies across different engines.

As JP Dub explained, “Gravitino is a technical metadata center which is engine friendly or engine centric.”

Recent Developments & Future Roadmap (Version 1.1 & Beyond)

The 1.1 release of Gravitino included:

  • Enhanced Multimodality Support: Native Lens REST catalog support and integration with the Daft engine.
  • Broader Lakehouse Support: A generic lakehouse catalog for extensibility with new table formats and engines.
  • Improved Security: Enhanced enterprise security features.
  • Operational Optimizations: Improvements for scalability and multi-cluster deployments.

The future roadmap for Gravitino and Data Strato focuses on:

  • Expanding Multimodality Support: Further enhancing support for diverse data formats and engines.
  • Agentic AI Integration: Optimizing Gravitino for agentic AI workloads, enabling easier data discovery and access.
  • Increased Engine Support: Adding support for a wider range of open-source and commercial engines.
  • Joining the Agentic AI Foundation: Collaborating with other organizations to develop open standards and solutions for agentic AI.

The Agentic AI Foundation & the Future of Metadata

Data Strato’s decision to bring Gravitino to the Agentic AI Foundation reflects a belief that the future of AI requires a fundamental shift in how metadata is managed. Traditional BI-focused metadata approaches are insufficient for the proactive data discovery and utilization needs of AI agents. A centralized, intelligent metadata layer is crucial for enabling secure, governed, and efficient data access for agentic AI systems. As JP Dub stated, “data governance is something master thing is not optional anymore.”

Conclusion

Apache Gravitino represents a significant step forward in metadata management for modern data and AI infrastructure. Its “catalog of catalogs” approach, engine-centric focus, and commitment to open standards position it as a key enabler for organizations seeking to unlock the full potential of their data in the age of AI. The project’s ongoing development and its involvement in the Agentic AI Foundation signal a commitment to shaping the future of data management for the next generation of AI applications.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "Meet Gravitino, a geo-distributed, federated metadata lake". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video