Observability in action: A Google Cloud Next demo

Key Concepts:

Model monitoring
System prompt changes
Data collection (prompt-response pairs)
BigQuery database
Model version tracking
Historical data analysis
Evaluation pipeline
Quality verification

Data Collection and Tracking

The core problem addressed is the difficulty in monitoring the quality of model responses when models and system prompts are frequently changed. The solution involves systematically collecting and tracking data related to prompts, responses, model versions, and system prompts.

Prompt-Response Pairs: The key is to capture pairs of prompts and their corresponding responses.
Model Versioning: Crucially, the system must track which model version was used to generate each response.
System Prompt Association: The system also needs to record which system prompt was active when a response was generated.

BigQuery Implementation

The proposed implementation uses Google Cloud's BigQuery as the central data repository.

Pub/Sub Integration: Prompts are sent via Pub/Sub to a BigQuery database.
Data Storage: BigQuery stores all relevant information, including the model used, its version, the prompt, and the response.
Historical Data: This creates a historical record of all changes and their impact on model behavior.

Historical Data Analysis and Evaluation

The historical data stored in BigQuery enables analysis and evaluation of model performance over time.

Change Impact Assessment: By comparing data from different model versions or system prompts, it's possible to assess the impact of changes on response quality.
Evaluation Pipeline: The historical data can be used to build an evaluation pipeline.
Quality Verification: This pipeline allows for verifying the quality of responses from one version to another.

Example Scenario

The video describes a demo with interactive "big enter buttons" to illustrate the process.

Prompt Submission: A prompt is sent through the system.
Data Logging: BigQuery logs the prompt, the response, the model version, and any other relevant metadata.
Version Comparison: If the model is updated from version two to version three, the historical data allows for comparing the quality of responses generated by each version.

Conclusion

The main takeaway is that systematic data collection and tracking, using tools like BigQuery, are essential for monitoring and evaluating the impact of changes to models and system prompts. This approach enables data-driven decision-making and ensures that model quality is maintained or improved over time.

Observability in action: A Google Cloud Next demo

Chat with this Video

Related Videos

Ready to summarize another video?