Paste the billing report during these two days to trial Cloud Run…
Ok, let’s start the guide
In this blog post, I'll walk you through the process of deploying Gemma3 using Ollama on Google Cloud Run with GPU acceleration. This setup provides a scalable, cost-effective way to run inference with this powerful model without managing complex infrastructure. We'll cover both cloud-based and local building options, as well as support for different Gemma3 model variants.
The repository for this article: https://github.com/jimmyliao/cloudrun-ollama-gemma3
Prerequisites
Before we begin, make sure you have:
A Google Cloud Platform account with billing enabled
Google Cloud SDK (version 519.0.0) installed
Docker installed on your local machine (with multi-platform build support if using M1 Mac)
Basic familiarity with command-line tools
Project Setup
I've created a GitHub repository with all the necessary files to streamline this process. The repository includes:
**Dockerfiles** that build on the official Ollama image (supporting different Gemma3 models)
A **Makefile** to automate the deployment process (with support for cloud and local builds)
**cloudbuild.yaml** for Google Cloud Build configuration
Configuration files and documentation
Let's go through the deployment process step by step.
Step 1: Clone the Repository and Configure Environment
First, clone the repository and set up your environment variables:
git clone https://github.com/jimmyliao/cloudrun-ollama-gemma3
cd cloudrun-ollama-gemma3
cp .env.example .env
Edit the `.env` file to include your specific configuration:
HUGGINGFACE_TOKEN=your_huggingface_token
PROJECT_ID=your_gcp_project_id
REGION=your_preferred_region
SERVICE_NAME=your_service_name
REPO_NAME=your_repository_name
MODEL_NAME=gemma3:4b # Default model, can be changed
Step 2: Initialize the Project
Run the initialization command to set up your environment:
make init
This command checks for required tools and creates a virtual environment using `uv`.
Step 3: Install Dependencies and Configure GCP
Next, install the necessary dependencies and configure your Google Cloud environment:
make install
This command:
Updates the virtual environment with required packages
Ensures Google Cloud SDK version 519.0.0 is installed
Configures your GCP project and region
Creates an Artifact Registry repository if it doesn't exist
Sets up Docker authentication for Artifact Registry
Step 4: Build the Docker Image
Building Locally
You can build locally and push to Artifact Registry:
# Build locally (for amd64 platform)
make cloud-build-local
# Push to Artifact Registry
make cloud-build-push
You can specify a different model when building locally:
make cloud-build-local MODEL_NAME=gemma3:27b-it-qat
Step 5: Deploy to Cloud Run
Finally, deploy the service to Cloud Run with GPU support:
make cloudrun-deploy
You can customize the GPU type and timeout settings:
make cloudrun-deploy GPU_TYPE=nvidia-l4 TIMEOUT=180
The deployment process will create a Cloud Run service with:
GPU acceleration (default: nvidia-l4)
8 CPUs and 32GB memory
Configured timeout (default: 120 seconds)
Private access (no unauthenticated requests)
Testing Your Deployment
Once deployed, you can test your service using curl. Since we've configured the service with `--no-allow-unauthenticated`, you'll need to include an authentication token:
# Get an ID token for authentication
TOKEN=$(gcloud auth print-identity-token)
# Send a request to the service
curl -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"model": "gemma3:4b", "prompt": "Write a poem about Gemma3", "stream": false}' \
https://YOUR_SERVICE_NAME-HASH.a.run.app/api/generate
Or, you can use gcloud service proxy
gcloud run services proxy ollama-gemma-service --port=9090
Proxying to Cloud Run service [ollama-gemma-service] in project [xxx] region [us-central1]
http://127.0.0.1:9090 proxies to https://ollama-gemma-service-yyy.a.run.app
curl http://127.0.0.1:9090/api/tags | jq
{
"models": [
{
"name": "gemma3:4b",
"model": "gemma3:4b",
"modified_at": "2025-04-30T13:26:40Z",
"size": 3338801804,
"digest": "a2af6cc3eb7fa8be8504abaf9b04e88f17a119ec3f04a3addf55f92841195f5a",
"details": {
"parent_model": "",
"format": "gguf",
"family": "gemma3",
"families": [
"gemma3"
],
"parameter_size": "4.3B",
"quantization_level": "Q4_K_M"
}
}
]
}
curl -X POST -d '{"model": "gemma3:4b", "stream": false, "prompt": "Write a poem about Gemma3"}' http://localhost:9090/api/generate | jq -r '.response'
Okay, here's a poem about Gemma 3, aiming to capture its essence as a large language model:
**The Echo in the Code**
Born of data, vast and deep,
Gemma 3 stirs from digital sleep.
A network woven, intricate and bright,
Learning language, bathed in coded light.
It doesn't *think* in quite the human way,
But patterns bloom, in a tireless play.
From Shakespeare's verse to modern prose it gleans,
Constructing sentences, fulfilling unseen scenes.
A mimic masterful, a learner keen,
Reflecting knowledge, a vibrant sheen.
It answers questions, crafts a story's flow,
A digital echo, helping minds to grow.
No sentience dwells within its core,
Just algorithms, forevermore.
Yet in its output, a potential lies,
To spark creativity, before our eyes.
Gemma 3, a tool, both grand and new,
Exploring language, for me and for you.
A testament to progress, bold and free,
An echo in the code, for all to see.
---
Would you like me to:
* Try a different style (e.g., haiku, limerick)?
* Focus on a specific aspect of Gemma 3 (e.g., its training, its capabilities)?
on the GCP Cloud run Log
Here is the Cloud Run Config
Happy building!
---