This pattern showcases a real-time conversational RAG agent powered by Google Gemini. The agent handles audio, video, and text interactions while leveraging tool calling with a vector DB for grounded responses.
Key components:
-
Python Backend (in
app/
folder): A production-ready server built with FastAPI and google-genai that features:- Real-time bidirectional communication via WebSockets between the frontend and Gemini model
- Integrated tool calling with vector database support for contextual document retrieval
- Production-grade reliability with retry logic and automatic reconnection capabilities
- Deployment flexibility supporting both AI Studio and Vertex AI endpoints
- Feedback logging endpoint for collecting user interactions
-
React Frontend (in
frontend/
folder): Extends the Multimodal live API Web Console, with added features like custom URLs and feedback collection.
Once both the backend and frontend are running, click the play button in the frontend UI to establish a connection with the backend. You can now interact with the Multimodal Live Agent! You can try asking questions such as "Using the tool you have, define Governance in the context MLOPs" to allow the agent to use the documentation it was provided to.
Explore these resources to learn more about the Multimodal Live API and see examples of its usage:
- Project Pastra: a comprehensive developer guide for the Gemini Multimodal Live API.
- Google Cloud Multimodal Live API demos and samples: Collection of code samples and demo applications leveraging multimodal live API in Vertex AI
- Gemini 2 Cookbook: Practical examples and tutorials for working with Gemini 2
- Multimodal Live API Web Console: Interactive React-based web interface for testing and experimenting with Gemini Multimodal Live API.
This pattern is under active development. Key areas planned for future enhancement include:
- Observability: Implementing comprehensive monitoring and tracing features.
- Load Testing: Integrating load testing capabilities.