Installation on GCP Cloud VMs#
Caution
This entire document is only intended for System Administrators or Infrastructure Engineers. Do not attempt to use this information without proper knowledge and understanding of the GCP tenancy. If you need assistance with cloud infrastructure deployment, please consult your internal Infrastructure team before contacting biomodal support.
Danger
The Terraform configurations provided in this documentation are examples only and must not be applied to production environments without thorough review and customization by an experienced Infrastructure Engineer. These examples may not meet your organization’s security, compliance, networking, or operational requirements. Always review and adapt the infrastructure code to your specific needs before deployment.
Minimal GCP Terraform Configuration#
This contains a minimal Terraform configuration for setting up basic GCP infrastructure with Java and Docker. This configuration focuses purely on infrastructure provisioning and does not include any application-specific logic.
What This Creates#
Infrastructure#
GCP VM: Ubuntu 22.04 LTS
Storage Bucket: Optional GCS bucket for data storage
Artifact Registry: Docker container registry for biomodal images
Static IP: External IP address for the VM
IAM: Service account with necessary permissions
Firewall: SSH access via Identity-Aware Proxy (IAP)
Software Installed#
Java 21: OpenJDK 21 for running Java applications
Docker: Container runtime for running containerized applications
Basic utilities: ca-certificates, curl, gnupg, lsb-release, wget, unzip
The installation script follows a straightforward approach, installing all required packages and software directly to ensure a complete and consistent environment setup.
Download Complete Configuration#
You can download all the Terraform configuration files as a single ZIP archive:
- download:
Download GCP Terraform Configuration <./zip_files/gcp_terraform.zip>
This ZIP file contains all necessary files to aid you in deploying the GCP infrastructure.
Caution
Please ensure you review and understand the Terraform configuration files before deploying to your environment.
Configuration Files#
The Terraform configuration consists of the following files:
main.tf- Core infrastructure configuration including VM, storage, and artifact registryvariables.tf- Input variablesoutputs.tf- Output values including artifact registry URLscripts/install_script.sh- VM startup script to install software and configure the environmentterraform.tfvars.example- Example variables file
Security Considerations#
IAP / SSH Access#
The VM is accessed via Identity-Aware Proxy (IAP) tunnelling rather than a direct public SSH port. This reduces exposure, but you must ensure:
Required IAP APIs are enabled
The service account or user identities have IAP TCP forwarding permissions
Firewall rules do not inadvertently expose port 22 publicly
Docker Socket Permissions#
The VM startup script in scripts/install_script.sh includes the following command:
sudo chmod 666 /var/run/docker.sock
Warning
Security Impact: Setting permissions to 666 on the Docker socket grants world-readable and world-writable access, which is a significant security risk. Any user or process on the system can interact with Docker, potentially leading to privilege escalation and container breakouts.
Recommendation: For production environments, consider removing this chmod command and rely exclusively on Docker group membership to control access. Users in the docker group will be able to interact with Docker after logging out and back in, or by running newgrp docker. Only use broader permissions if you have specific requirements that necessitate immediate Docker access without re-authentication, and document why this is necessary for your use case.
For more information on Docker security, see Docker security best practices.
General Cloud installation requirements#
Cloud native software will be utilised on each respective cloud platform to set up the complete pipeline environment. You will also need to install the Google Cloud CLI to administer GCP resources, unless this is provided by default.
Cloud permissions#
We recommend that a least privilege approach is taken when providing users with permissions to create cloud resources.
The cloud specific examples below demonstrate the minimum required permissions to bootstrap resources for GCP environments.
We recommend using GCP’s predefined roles. These roles are created and maintained by Google, so you do not need to create custom roles.
Below are the recommended predefined GCP roles you require access to create cloud resources for running this CLI.
Role name |
Purpose |
Required |
|---|---|---|
roles/storage.admin |
If a new bucket is required |
No (unless new bucket is specified) |
roles/artifactregistry.writer |
Create required artifact registry |
Yes |
roles/iam.serviceAccountCreator |
Creating required service account(s) |
Yes |
roles/compute.admin |
Creating required compute resources and instance template |
Yes |
roles/iap.admin |
Allowing IAP access to VMs |
Yes |
roles/resourcemanager.projectIamAdmin |
To assign project wide roles to service accounts |
Yes |
roles/storage.legacyBucketWriter |
Allow read/write permission on existing bucket |
No (Required if wishing to use an existing bucket) |
Usage of this Terraform Configuration#
Prerequisites#
Terraform >= 1.0
GCP credentials configured (
gcloud auth application-default login)GCP project with billing enabled
VPC network and subnet (can use
default)
Setup#
# Download and extract the Terraform configuration
# (if using the ZIP file from this documentation)
unzip gcp_terraform.zip
cd gcp_terraform/
# Copy example variables
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your values
vim terraform.tfvars
# Initialize Terraform
terraform init
# Review the plan (Ensure it looks correct, and no errors or changes you don't expect)
terraform plan -var-file=terraform.tfvars -out=tfplan
# Apply the configuration using the saved plan (creates resources, you will be prompted to confirm)
terraform apply tfplan
Warning
Destroying Infrastructure
The commands below will permanently delete all resources created by Terraform, including:
GCP VM instances and their data
GCS buckets and all stored data (if bucket_force_destroy is enabled)
Artifact Registry repositories and container images
Service accounts and IAM bindings
Firewall rules and network configurations
Static IP addresses
This action is irreversible. Always:
Backup any important data before destroying resources
Carefully review the destroy plan output to confirm which resources will be deleted
Ensure you are working in the correct GCP project
Consider commenting out or removing the bucket_force_destroy setting to prevent accidental data loss
# Destroy the configuration when no longer needed
terraform plan -destroy -var-file=terraform.tfvars -out=destroyplan
# Be sure to review the plan output carefully to ensure you understand which resources will be destroyed.
terraform apply destroyplan
Terraform Outputs#
After terraform apply, you’ll see the folllowing output (example values using europe-west2 region):
+ artifact_registry_url = "europe-west2-docker.pkg.dev/your-gcp-project-id/your-gcp-project-vm"
+ bucket_name = "your-gcp-project-vm"
+ bucket_url = (known after apply)
+ service_account_name = "your-gcp-project-vm@your-gcp-project-id.iam.gserviceaccount.com"
+ ssh_command = "gcloud compute ssh --zone=europe-west2-a --tunnel-through-iap your-gcp-project-vm --project=your-gcp-project-id"
+ vm_external_ip = (known after apply)
+ vm_hostname = "your-gcp-project-vm"
+ vm_name = "your-gcp-project-vm"
Important
Please save the ssh_command output as you will need this to connect to the VM.
The service account email may also be needed for configuring other services.
The artifact registry URL is needed for configuring where to store the biomodal container images using the biomodal init command for the biomodal CLI.
The bucket URL is needed for configuring the biomodal CLI if using this bucket for data storage. Please remember to add the gs:// prefix if using bucket URLs.
Warning
After you have run biomodal init, you have to make a small change to the nextflow_override.config file. Please manually edit this file on the VM after connecting, and change the following placeholder value for the VM name <<your-gcp-project-vm>>:
serviceAccountEmail = "<<your-gcp-project-vm>>@example-gcp-project-id.iam.gserviceaccount.com"
Please ensure you change the <<your-gcp-project-vm>> placeholder to your actual GCP VM name from the Terraform outputs you recorded above.
You must also ensure that the referenced network and subnet exist in your GCP project. The initial configuration assumes the default network and subnet exist.
Connect to VM#
After successfully applying the Terraform configuration, please allow a few minutes for the VM to complete its startup script installation, then you can connect to the VM using SSH via IAP:
# SSH via IAP (use output from terraform apply, this is just an example)
gcloud compute ssh --zone=your-gcp-region-zone --tunnel-through-iap your-gcp-project-vm --project=your-gcp-project-id
Configuration Variables#
Required Variables#
project_id = "your-gcp-project-id" # GCP project ID
region = "europe-west2" # GCP region
network_name = "default" # VPC network name
subnet_name = "default" # Subnet name
vm_name = "your-gcp-project-vm" # VM name
Optional Variables#
machine_type = "n2-standard-2" # VM machine type
disk_size_gb = "100" # Boot disk size
label_key = "biomodal" # Resource label key
label_value = "cli-production" # Resource label value
use_existing_bucket_url = "gs://..." # Use existing bucket
bucket_force_destroy = false # Allow bucket destruction
Outputs#
After terraform apply, you’ll get:
VM external/internal IP addresses
Storage bucket URL and name
Artifact registry URL for container images
Service account email
SSH command for connecting
What’s NOT Included#
This minimal configuration intentionally excludes:
Application-specific software installation
Pipeline or workflow management tools
Custom configuration files or templates
File copying or deployment logic
Version-specific software management
Design Philosophy#
This configuration follows the principle of separation of concerns:
Infrastructure: Terraform handles VM, storage, and networking
Platform: Basic runtime dependencies (Java, Docker)
Applications: Should be deployed separately after infrastructure is ready
This approach makes the infrastructure:
Reusable: Can be used for different applications
Maintainable: Clear separation between infrastructure and application concerns
Testable: Infrastructure can be validated independently
Flexible: Applications can be deployed using different methods (Docker, packages, etc.)
Next Steps#
After the infrastructure is ready:
Verify base runtime (Java & Docker):
java -version docker --version sudo systemctl status docker
Install the biomodal CLI.
(Optional) Configure monitoring and logging (e.g. Cloud Logging, Cloud Monitoring alerts).
Note: The terraform installation script follows a direct installation approach. Package managers like apt-get handle duplicate installations gracefully, so the script can be run multiple times safely with minimal overhead.
Cost Optimization Strategies#
Instance Cost Optimization#
Use smaller machine types (e.g.
e2-standard-2orn2-standard-2) for the orchestrator if load is lightStop the VM when not launching pipelines; restart only when needed
Consider committed use discounts for long-term predictable usage
Evaluate preemptible VMs for non-critical orchestration (not recommended for persistent state)
Storage Cost Optimization#
Use lifecycle management on GCS buckets to transition older objects to colder storage classes
Delete temporary Nextflow work directories after successful pipeline completion
Consolidate log files and compress large text outputs
Avoid storing large intermediate artifacts long term; regenerate if cheaper
Monitoring and Cost Alerts#
Set up Cloud Billing budgets with alert thresholds
Use Cloud Monitoring dashboards to track CPU, memory, and storage usage
Tag / label resources (e.g. environment, owner) for cost attribution
Periodically review idle resources (static IPs, unused buckets, artefact registry storage)
Low Quotas / Service Limits#
Request quota increases early for CPU, persistent disk, and Batch (if used) in growth scenarios
In trial accounts, plan runs within current quotas or request increases via the console
Contact support@biomodal.com for guidance on resource profiles under constrained quotas
Installation Script Technical Details#
The VM startup script (scripts/install_script.sh) implements a straightforward installation approach to ensure a complete and consistent environment setup.
Direct Package Installation#
The installation script installs all required packages directly using a simple, reliable approach:
Installation Process
System update: Updates package repositories with
apt-get updateDirect installation: Installs all required packages in a single
apt-get installcommandReliable execution: Uses
apt-get’s built-in handling of already-installed packagesSimple approach: No complex checking logic, ensuring consistent behavior
Required System Packages
# Packages checked and installed if missing:
ca-certificates # SSL/TLS certificates for secure connections
curl # Command line tool for data transfer
gnupg # GNU Privacy Guard for encryption/signing
lsb-release # Linux Standard Base release information
openjdk-21-jdk # Java Development Kit version 21
wget # Network downloader
unzip # Archive extraction utility
Direct Docker Installation#
Docker installation follows the same straightforward approach:
Docker Installation Process
Direct installation: Installs Docker using
apt-get install docker.ioService configuration: Starts and enables Docker service
User permissions: Adds the current user to docker group for non-root access
Session permissions: Sets appropriate socket permissions for immediate access
Docker Configuration (Automatically Applied)
The script configures Docker for proper operation:
# Service management
sudo systemctl start docker # Start Docker service
sudo systemctl enable docker # Enable Docker on boot
# User permissions (reliable username detection)
ACTUAL_USER=$(who am i | awk '{print $1}' || echo "${USER:-$(whoami)}")
sudo usermod -aG docker "$ACTUAL_USER" # Add user to docker group
sudo chmod 666 /var/run/docker.sock # Socket access for current session
Reliable User Detection
The script uses multiple fallback methods to identify the correct username:
Primary:
who am icommand to get the actual logged-in userFallback 1:
$USERenvironment variableFallback 2:
whoamicommand outputSafety check: Skips root user to avoid security issues
Installation Feedback
The script provides clear logging throughout the process:
Reports the start of package installation
Confirms when package installation is completed
Shows Docker installation progress
Displays user permission configuration
Confirms when Docker installation is completed
Benefits of Direct Installation
Simplicity: Straightforward, easy-to-understand process
Reliability: Uses standard package manager behavior for duplicate handling
Consistency: Ensures the same installation process every time
Terraform compatibility: Simple script structure works well with Terraform templatefile()
Minimal complexity: No conditional logic reduces potential failure points
Troubleshooting#
Common Issues and Solutions#
VM Creation Failures
If VM creation fails, check:
GCP APIs: Ensure all required APIs are enabled in your project
Quotas: Verify you have sufficient compute quotas in the target region
Network: Confirm the specified VPC network and subnet exist
Permissions: Validate your service account has necessary IAM roles
Installation Script Issues
If the startup script fails:
# Check startup script logs
sudo journalctl -u google-startup-scripts.service
# View cloud-init logs
sudo cat /var/log/cloud-init-output.log
Docker Permission Issues
If Docker commands fail with permission errors or you see user-related errors during installation:
# Check if user is in docker group
groups $USER | grep docker
# If not in docker group, add manually
sudo usermod -aG docker $USER
# Check Docker socket permissions
ls -la /var/run/docker.sock
# Fix socket permissions if needed
sudo chmod 666 /var/run/docker.sock
# Re-login or start a new session to apply group membership
exit
# SSH back in, or:
newgrp docker
# Alternative: restart the session
sudo su - $USER
# Test Docker access
docker run hello-world
Common Docker Installation Issues
“user does not exist” errors: The installation script now uses reliable user detection methods
Permission denied on socket: The script sets appropriate socket permissions (666)
Group membership not applied: May require logout/login or using
newgrp docker
Package Installation Failures
If specific packages fail to install:
# Update package lists
sudo apt-get update
# Try installing individual packages
sudo apt-get install -y package-name
# Check for held packages
sudo apt-mark showhold
Terraform State Issues
If Terraform operations fail:
# Refresh state
terraform refresh
# Import existing resources if needed
terraform import google_compute_instance.vm projects/PROJECT/zones/ZONE/instances/INSTANCE
# Plan with detailed output
terraform plan -detailed-exitcode
Terraform Plan File Best Practices
Always use the -out option when planning to ensure consistent deployments:
# Create a plan file to guarantee exact execution
terraform plan -var-file=terraform.tfvars -out=tfplan
# Apply the exact planned changes
terraform apply tfplan
# For destroy operations, also use plan files for safety
terraform plan -destroy -var-file=terraform.tfvars -out=destroyplan
# Apply the exact destruction plan
terraform apply destroyplan
This approach prevents drift between what you reviewed in the plan and what gets applied, especially important in production environments where infrastructure changes between the plan and apply commands. For destroy operations, this ensures you know exactly which resources will be deleted before proceeding.
GCP Services and Resources#
Project Services#
The Terraform configuration automatically enables the following GCP services required for the biomodal pipeline infrastructure:
compute.googleapis.com- Compute Engine API for VM creation and managementbatch.googleapis.com- Batch API for workload processingartifactregistry.googleapis.com- Artifact Registry API for container image storagestorage-api.googleapis.com- Cloud Storage API for data storageiap.googleapis.com- Identity-Aware Proxy for secure SSH accessiam.googleapis.com- Identity and Access Management APIoslogin.googleapis.com- OS Login API for SSH key managementcloudresourcemanager.googleapis.com- Resource Manager API for project management
These services are enabled automatically during the Terraform deployment, ensuring all necessary APIs are available for the biomodal infrastructure to function properly.
Artifact Registry#
This resource creates a container registry in GCP using Artifact Registry (Google Container Registry is now deprecated), which will store biomodal container images. These containers are Docker images containing open-source and proprietary software required for the biomodal pipeline.
Registry Configuration
Repository Name: Uses the
vm_namevariable to create a uniquely named repositoryFormat: Docker repository for container images
Location: Created in the same region as the VM for optimal performance
Mode: Standard repository mode for general container storage
Container Image Management
The artifact registry serves as the central repository for:
Pipeline containers: Core biomodal pipeline processing images
Tool containers: Supporting bioinformatics tools and utilities
Custom images: Any custom-built containers for specific workflows
Integration with biomodal CLI
The registry URL is automatically configured and made available to the biomodal CLI through:
Terraform outputs: Registry URL provided as
artifact_registry_urlCLI configuration: Can be referenced during
biomodal initsetupPipeline execution: Nextflow automatically pulls containers from this registry
Access and Permissions
The service account created by Terraform has the necessary permissions to:
Push container images to the registry
Pull container images during pipeline execution
Manage repository contents and metadata
Storage Bucket#
The Terraform configuration creates a Google Cloud Storage (GCS) bucket for data storage. This bucket serves as:
Input storage: Store input files such as FASTQ files generated from sequencing
Working directory: Pipeline intermediate files and temporary data
Output storage: Final analysis results and generated reports
Bucket Configuration
Naming: Automatically named based on the
vm_namevariable with a unique suffixLocation: Created in the same region as the VM for optimal performance
Access control: Configured with uniform bucket-level access for simplified permissions
Optional: Can be disabled by setting
use_existing_bucket_urlto use an existing bucket instead
Bucket URL Access
The bucket URL is provided as a Terraform output, making it available for application configuration after the VM is deployed. The installation script focuses on system setup only, with bucket configuration handled separately during application setup.
Orchestrator VM#
Resource to create a virtual machine (VM) on GCP. This VM is where biomodal duet pipeline will run from with Nextflow acting as an orchestrator, and will allow for session persistence that a laptop doesn’t. Your orchestration VM only needs to run when launching pipelines. You can stop it from the cloud providers console when inactive, and start again when you need to launch another analysis run.
Service Account#
A service account is an account that is used by a non-human process or application to authenticate with GCP, and then use that account to authenticate with other services or APIs. The service account is often used in conjunction with other services, like Google Cloud Storage and Compute Engine virtual machines, to provide secure access to your resources.
Note
The cloud VM will be created with a public IP to allow for data ingress and egress. Please update this as required after software install, authentication, configuration, and testing.
Low GCP CPU quotas#
If using GCP with limited CPU quotas (e.g. free trial accounts), please contact support@biomodal.com as a custom resource profile may be needed. You may also need to request quota increases from GCP support via the GCP console.