Welcome to the Gemma-4 DevOps Agents workspace. This repository contains nine specialized, self-hosted AI-driven DevOps/SRE agents powered by Google's Gemma 4 model. These agents are packaged as Model Context Protocol (MCP) servers to analyze, monitor, and troubleshoot infrastructure components.
This workspace is organized into nine distinct sub-agents, each tailored to a specific environment, model configuration, and serving stack:
| Sub-Agent | Purpose | Serving Engine | Target Infrastructure |
|---|---|---|---|
| Local DevOps Agent | CPU/GPU local analysis & prototyping | Ollama / vLLM | Local Docker / Workstations |
| GPU DevOps Agent (4B L4) | Serverless cloud SRE (4B model on L4 GPU) | vLLM | Google Cloud Run (us-east4) |
| GPU DevOps Agent (4B 6000) | Serverless cloud SRE (4B model on RTX 6000 GPU) | vLLM | Google Cloud Run (us-central1) |
| GPU DevOps Agent (26B 6000) | Serverless cloud SRE (26B model on RTX 6000 GPU) | vLLM | Google Cloud Run (us-central1) |
| GPU DevOps Agent (31B 6000) | Serverless cloud SRE (31B model on RTX 6000 GPU) | vLLM | Google Cloud Run (us-central1) |
| GPU DevOps Agent (6000) | Serverless cloud SRE (RTX 6000 GPU configuration) | vLLM | Google Cloud Run (us-central1) |
| GPU DevOps Agent (vLLM) | Serverless cloud SRE (L4 GPU configuration) | vLLM | Google Cloud Run (us-east4) |
| GPU DevOps Agent (31B QAT L4) | Serverless cloud SRE (31B QAT model on L4 GPU) | vLLM | Google Cloud Run (us-east4) |
| TPU DevOps Agent (26B) | Ultra-high performance TPU SRE (26B configuration) | vLLM | Google Cloud TPUs (v6e Trillium) |
| TPU DevOps Agent (31B) | Ultra-high performance TPU SRE (31B configuration) | vLLM | Google Cloud TPUs (v6e Trillium) |
| TPU DevOps Agent (12B v6e-1) | Ultra-high performance TPU SRE (12B configuration) | vLLM | Google Cloud TPUs (v6e Trillium) |
- Automated SRE Diagnostics: Fetches and reviews system, container, and Cloud Logging entries using Gemma 4 to identify root causes and generate 3-step remediation plans.
- Serving Stack Control: Built-in tools to provision, start, stop, restart, and scale your vLLM and Ollama containers or Cloud TPU Queued Resources.
- Observability Dashboards: Real-time dashboards monitoring HBM usage, Tensor Core pressure, Prometheus metrics, and service latencies.
- Model Benchmarking: Tools to run load tests and vLLM's internal benchmark suites, returning performance metrics (TTFT, throughput, P95 latency).
- Gemini CLI Integration: Custom setup instructions using a LiteLLM Proxy to route standard Gemini CLI commands directly to your private, self-hosted Gemma 4 instance.
A root Makefile is provided to manage the sub-agents collectively:
- Help / Display commands:
make all
- Install dependencies in all subdirectories:
make install
- Run tests across all agents:
make test - Lint all Python directories:
make lint
- Clean build/cache folders:
make clean
- Role: Specialized SRE for local containerized workloads.
- Inference Stack: Runs
gemma4:e2borgoogle/gemma-4-E2B-itvia local Docker (ollama/ollamaor CPU/GPU vLLM). - Documentation: See local-devops-agent/README.md and local-devops-agent/GEMINI.md.
- Role: SRE for serverless GPU-accelerated Cloud Run endpoints running the 4B configuration on L4 GPU.
- Inference Stack: Runs
google/gemma-4-E4B-itvia vLLM on Cloud Run. - Documentation: See gpu-4B-L4-devops-agent/README.md.
- Role: SRE for serverless GPU-accelerated Cloud Run endpoints running the 4B configuration on RTX 6000 GPU.
- Inference Stack: Runs
google/gemma-4-E4B-itvia vLLM on Cloud Run. - Documentation: See gpu-4B-6000-devops-agent/README.md.
- Role: SRE for serverless GPU-accelerated Cloud Run endpoints running the 26B configuration on RTX 6000 GPU.
- Inference Stack: Runs
google/gemma-4-26B-itvia vLLM on Cloud Run. - Documentation: See gpu-26B-6000-devops-agent/README.md.
- Role: SRE for serverless GPU-accelerated Cloud Run endpoints running the 31B configuration on RTX 6000 GPU.
- Inference Stack: Runs
google/gemma-4-26B-A4B-itvia vLLM on Cloud Run. - Documentation: See gpu-31B-6000-devops-agent/README.md.
- Role: Cloud-based SRE managing GPU-accelerated serverless endpoints (RTX 6000 GPU configuration).
- Inference Stack: Runs
google/gemma-4-E4B-itvia vLLM on Cloud Run. - Documentation: See gpu-6000-devops-agent/README.md.
- Role: Cloud-based SRE managing GPU-accelerated serverless endpoints (L4 GPU configuration).
- Inference Stack: Runs
google/gemma-4-E4B-itvia vLLM on Cloud Run. - Documentation: See gpu-vllm-devops-agent/README.md.
- Role: High-performance TPU SRE/DevOps managing large-scale private clusters (26B configuration).
- Inference Stack: Runs
google/gemma-4-31B-itvia vLLM on Google Cloud TPUs (v6e Trillium). - Documentation: See tpu-26B-devops-agent/README.md.
- Role: High-performance TPU SRE/DevOps managing large-scale private clusters (31B configuration).
- Inference Stack: Runs
google/gemma-4-31B-itvia vLLM on Google Cloud TPUs (v6e Trillium). - Documentation: See tpu-31B-devops-agent/README.md and tpu-31B-devops-agent/GEMINI.md.
- Role: Serverless cloud SRE leveraging the 31B QAT configuration on L4 GPU.
- Inference Stack: Runs
google/gemma-4-31B-it-qat-w4a16-ctvia vLLM on Cloud Run. - Documentation: See gpu-31B-qat-L4-devops-agent/README.md.
- Role: High-performance TPU SRE/DevOps managing clusters (12B configuration).
- Inference Stack: Runs
google/gemma-4-12B-itvia vLLM on Google Cloud TPUs (v6e Trillium). - Documentation: See tpu-12B-v6e1-devops-agent/README.md and tpu-12B-v6e1-devops-agent/GEMINI.md.
When deploying to Google Cloud or Hugging Face, secure credentials using:
- Hugging Face Access Token: Saved locally or to Google Secret Manager.
- Application Default Credentials (ADC): Set up using GCP credentials helper scripts.
Google Cloud credits are provided for this project.
#AgenticArchitect #GoogleAntigravity