duet software installation and running guide#

For analysis of libraries sequenced on an Illumina platform and prepared using:

biomodal duet multiomics solution +modC
biomodal duet multiomics solution evoC

This documentation is compatible with duet pipeline v1.4

Warning

For Research Use Only. Not for use in diagnostic procedures.

Getting started with the biomodal duet pipeline#

Overview#

The biomodal duet multiomics pipeline is a bioinformatics workflow designed to analyse FASTQ files produced by next generation sequencing of the libraries generated using biomodal duet multiomics solutions. Before running the pipeline, you need to install the biomodal Command Line Interface (CLI) on a cloud platform or High-Performance Computing (HPC) cluster.

About this guide#

This guide walks you through:

Downloading the required files
Installing the CLI
Preparing your compute environment
Using the CLI
Importing the pipeline
Updating the pipeline

Important

The biomodal duet multiomics pipeline runs in a Linux environment and we recommend the use of a recent version of Ubuntu or CentOS/Rocky/RHEL. It is recommended that setup and installation be performed by your system administrator (i.e., the person responsible for managing your organisation’s IT systems.)

Sections which need to be completed by a system administrator will be shown by the following:

Caution

This section contains information intended for system administrators

Once the setup is complete, you’ll be able to run your analyses with minimal technical involvement. If you need support with any of the steps or installation, please contact us at support@biomodal.com.

What’s included in the biomodal duet pipeline#

Pipeline components

Standard and bespoke trimming of duet +modC and duet evoC FASTQ files
FASTQ resolution to convert raw paired-end duet +modC and duet evoC reads into resolved, error-suppressed 4-base genomic reads with accompanying epigenomic information
Reference genome and control alignment using BWA_MEM
Lane-merging of aligned BAM files
Forking of aligned genome reads and control reads into separate BAM files and analysis pathways
Deduplication of the genome and long control BAM files

Pipeline outputs

Modified cytosine quantification in CpG context
Optional modified cytosine quantification in CHG and CHH contexts
Germline variant calling with optional joint variant calling, somatic variant calling, and allele-specific methylation calling
Genetic accuracy analysis using controls
Analysis and generation of quality metrics associated with the genome-aligned reads and the control-aligned reads
Generation of summary reports in Excel and HTML formats

About the pipeline and Command Line Interface (CLI)#

Important

The biomodal CLI and duet pipeline is not supported on Windows platforms. We recommend you use a recent version of Ubuntu or CentOS/Rocky/RHEL.

The biomodal pipeline and CLI utilises Nextflow as the orchestration tool that will leverage the capabilities of your compute platform to process your data through the duet pipeline. Nextflow will act as an orchestrator performing the following tasks:

Launching Virtual Machines or HPC nodes
Copying input files to the virtual machines or HPC nodes
Downloading and running container images with appropriate software dependencies
Executing analyses on the virtual machines or HPC nodes
Transferring outputs from analyses to local or cloud storage
Organise output files into a convenient directory structure
Coordinating the channelling of outputs from one stage of analysis to the next

The following instructions are intended to be executed at the command line with Bash on a Linux platform, or via a Cloud Shell in a browser. Bash is a format of command structure command [OPTIONS] arguments. If you prefer to use the graphical user interface (GUI) of your Cloud provider console through the browser, searching the command line instructions will lead you to the platform specific documentation with the steps that need to be taken in the relevant cloud console.

Support#

Feel free to contact us at support@biomodal.com. If your inquiry is related to the CLI or duet pipeline software, please include the output of the biomodal info command in your inquiry.

Downloading the installation script#

Before installing the CLI, you need to download the biomodal installation package. This section helps you:

Choose the correct script for your system (HPC or Cloud)
Confirm software requirements
Download and unzip the installation files

Important

The following section involves cloud infrastructure setup or requires HPC admin privileges so must be performed by a system administrator. Once the setup is complete, the analyses can be run with minimal technical involvement.

If you need help with any of the steps or installation, please contact us at support@biomodal.com.

Choose your environment#

Caution

This section contains information intended for system administrators

The biomodal duet pipeline supports two environments:

HPC clusters (Ubuntu, CentOS/RHEL supported)
Cloud platforms (AWS, GCP, Azure)

Use the flowchart below to help you decide which installation method is right for you:

Flowchart to help you choose the right installation method

Check software requirements & permissions#

Caution

This section contains information intended for system administrators

HPC installation requirements#

Make sure your HPC system has the following tools installed:

Dependency	Purpose	Detail
bash	Command execution	Version 4+
JAVA (more information here)	Required by Nextflow	Version 8+ Please check that this is supported by your version of Nextflow.
Nextflow (more information here)	Pipeline orchestrator	Version 21.06 onwards
Singularity or Apptainer	Container runtime	Please check that this is supported by your version of Nextflow. The biomodal CLI expects to find Singularity in $PATH, so please make sure that Apptainer installations include a symlink to the Apptainer binary named Singularity.
Google Cloud CLI (more information here)	Software download and authentication	This should be a recent version that supports the “gcloud storage” command, including the required gcloud-crc32c libraries.
tar, unzip	File compression tools
rsync	File transfer
curl	Software download
jq	For JSON parsing
Other OS support packages:	gnupg, lsb-release, apt-transport-https, gnupg, ca-certificates

Some local Conda installations may require Google SDK components to ensure integrity of data being downloaded. This can be installed using:

gcloud components install gcloud-crc32c

Cloud installation requirements#

Cloud native software will be utilised on each respective platform to set up the complete pipeline environment. You will also need to install the Google Cloud CLI.

Dependency	Purpose	Detail
Google Cloud CLI (more information here)	Required for all cloud installations	This should be a recent version that supports the `gcloud storage` command, including the required `gcloud-crc32c` libraries.

Cloud permissions (if using Cloud VMs)#

Caution

This section contains information intended for system administrators

We recommend that a least privilege approach is taken when providing users with permissions to create cloud resources.

The cloud specific examples below demonstrate the minimum required permissions to bootstrap resources for AWS, GCP, and Azure environments.

AWS#

Run the following script to create a complete AWS pipeline environment:
```
./biomodal-cloud-utils create aws
```
Ensure GitHub repository access is allowed, and public IP assignment is enabled for AWS Batch. This feature can be manually enabled in the console, via the Auto-assign public IPv4 address option in the subnet’s configuration.
This will use an additional Terraform module we provide for resource setup (located here)

GCP#

We recommend using GCP’s predefined roles. These roles are created and maintained by Google, so you do not need to create custom roles.

Below are the recommended predefined GCP roles you require access to create cloud resources for running this CLI.

Role name	Purpose	Required
roles/storage.admin	If a new bucket is required	No (unless new bucket is specified)
roles/artifactregistry.writer	Create required artifact registry	Yes
roles/iam.serviceAccountCreator	Creating required service account(s)	Yes
roles/compute.admin	Creating required compute resources and instance template	Yes
roles/iap.admin	Allowing IAP access to VMs	Yes
roles/resourcemanager.projectIamAdmin	To assign project wide roles to service accounts	Yes
roles/storage.legacyBucketWriter	Allow read/write permission on existing bucket	No (Required if wishing to use an existing bucket)

Azure#

If you are planning on using Azure as your cloud environment, please get in contact with us at support@biomodal.com for tailored setup instructions.

Download the installation files#

Caution

This section contains information intended for system administrators

We have enabled a simple tool to directly download all the required scripts to your local computer, HPC node, or cloud VM.

Please run the following command from a Linux terminal or cloud shell:
```
bash <(curl -s https://app.biomodal.com/cli/download)
```
You’ll be prompted to authenticate with your existing biomodal username and password:
- If you do not have a biomodal account, please contact us at support@biomodal.com
- To reset your password, click here: https://app.biomodal.com/auth/pages/reset-password
After download, unzip this file before you can start using the CLI scripts. Please note that you can use different tools to unzip the file (for example unzip) but you can choose other tools, like jar xvf, to uncompress the file:
```
unzip biomodal.1.1.3.zip
```

In the unzipped biomodal.1.1.3 folder, you will see the following files:

File or folder	Description
`biomodal-cloud-utils`	Script to create AWS, GCP or Azure could environment
`biomodal-hpc-utils-admin`	System administrator installation script
`biomodal-hpc-utils-conda`	Conda-based installation script
`cli/_biomodal_functions.sh`	Helper functions for biomodal CLI
`cli/_biomodal_validate.sh`	Validator functions for biomodal CLI
`cli/biomodal`	Main biomodal CLI script
`clientConfig.json`	Customer authentication configuration settings
`terraform/`	Folder containing Terraform script software for each supported Cloud platform
`docs/`	Folder with copy of all biomodal CLI documentation
`_images/`	Images used by documentation

Authentication & configuration#

Caution

This section contains information intended for system administrators

Upon login, a ~/.biomodal-auth.json file will be created to control the tokens required to authorise access to the biomodal resources
The default location of this file is in the users $HOME directory
Each user needs their own token file for running the CLI if event and metrics sharing is enabled (for more information, please see Sharing event & metric data with biomodal)

Installing the biomodal CLI#

Important

If you need help with any of the steps or installation, please contact us at support@biomodal.com.

Preventing timeout#

Caution

This section contains information intended for system administrators

Because some analyses can run for several hours or even days, we recommend running installation and pipeline commands inside a terminal multiplexer like tmux or screen. This ensures your job won’t be interrupted if your connection drops or you close your terminal window.

Installation on Cloud VMs (AWS / GCP / Azure)#

Caution

This section contains information intended for system administrators

We have created the biomodal-cloud-utils script to assist with the creation of cloud resources required to run the biomodal duet pipeline with Nextflow. The currently supported clouds are GCP, AWS, and Azure.

Please follow the following steps from a Linux compatible computer:

Install Terraform: see this link for OS-specific instructions
Ensure you have permissions to create cloud resources
Authenticate using your cloud provider CLI:
- GCP: gcloud auth login
- AWS: aws configure
- Azure: az login

Run the create command for your cloud:

./biomodal-cloud-utils create gcp

./biomodal-cloud-utils create aws

./biomodal-cloud-utils create azure

Connect to the new environment e.g.:
```
./biomodal-cloud-utils connect gcp
```
When complete, you can choose to remove the newly created resources e.g.:
```
./biomodal-cloud-utils destroy gcp
```

Warning

IMPORTANT INFORMATION

WARNING: Removing your resources is a permanent action, only run this command when you are sure you no longer need them. Buckets you added additional files to should not be deleted.

Note

The cloud VM will be created with a public IP to allow for data ingress and egress. Please update this as required after software install, authentication, configuration, and testing.

Installing on HPC clusters#

Caution

This section contains information intended for system administrators

We have created two scripts to assist with installing required software and downloading the pipeline on an HPC cluster, or standalone server:

biomodal-hpc-utils-admin: Recommended for centralised deployment
biomodal-hpc-utils-conda: Used if the user lacks the necessary permissions to install software centrally, or when local user software installations need to be separate to not interfere with other installations - but this option is not preferred

Important

Please check and review every step before proceeding as every HPC is different. The following instructions are intended as a guide only.

The software installation scripts described in this section are not intended to be re-run when you upgrade to a newer version of the biomodal CLI or duet pipeline. Please see Updating the biomodal pipeline for more information on how to update the CLI and pipeline software.

Restrictive quotas on local disk resources and software install limitations are likely to cause issues during installation and update operations.

Admin installation (recommended)#

The admin installation script sets up everything your system needs to run the biomodal pipeline on a HPC cluster or standalone Linux server.

This script will:

Install necessary OS and system software (e.g. Singularity)
Create directories for input/output data
Configure access to biomodal resources
Generate the required configuration files

Executing the scripts will ask to confirm that you have read the documentation and understand the implications of running this script on your system.

Please confirm that you have read the documentation and
understand the implications of running this script on your system.

Important

While the script is interactive and user-friendly, it requires root (super-user) privileges to install software packages. Please ensure this is done by your system administrator. If you have any questions, please contact support@biomodal.com.

Note

The suggested HPC setup will use Singularity as an alternative to Docker, but please note that Singularity does not enable, nor propagate, root privileges to the Singularity images.

We recommend using tmux or screen to keep sessions alive for long-running tasks. See Preventing timeout for more information.

To begin run the following script:

./biomodal-hpc-utils-admin

Follow the prompts and answer the required questions

CLI installation path

Enter cli installation path. : /biomodal

This is where the CLI software will reside on your system. Ideally choose a central location that users on the system can access if needed:

Default: /biomodal
Alternative: Choose any directory with enough free space and write access.

Note

The HPC users that will run the biomodal pipeline will require read and execute access to this location

Data bucket path

Enter data bucket path - shared location where ref data and images
will be stored:

/biomodal/data_bucket

This path will store reference data, pipeline images, and shared resources.

Default: /biomodal/data_bucket
Alternative: Choose a shared network storage location, if working in a group

This location can be re-used for other biomodal runs or installations in the future

Choose Docker registry location closest to your physical location or data centre. This ensures faster downloads and fewer errors during pipeline execution.
```
Choose relevant geographical location of the biomodal docker registry
(1-3)?

1) asia
2) europe
3) us

#?
```

Set max concurrent jobs

Current queueSize is: 200

Would you like to change it? (y/n)

This controls how many jobs Nextflow can run at once. If jobs keep failing or hanging, try lowering this value later.

Yes: Enter a lower number if you’re working with limited HPC resources
No: Keep the default of 200 if you’re unsure

Select job scheduler (executor)

Which executor would you like to use? (choose 1-4)

1) slurm
2) lsf
3) sge
4) local
#?

Select the job scheduler used by your HPC system:

slurm, lsf, or sge: Choose based on what your HPC uses
local: Use this only if you’re running everything on a single server without a scheduler

Once the script finishes you may be prompted to log out and back in again to apply the changes. In some cases, you might need to restart Linux system kernel services

Your system is now ready to import and run the biomodal duet pipeline.

If you have any questions, please contact support@biomodal.com.

Conda installation (not preferred)#

Caution

This section contains information intended for system administrators

While the CLI may be available via a Conda environment we do not recommend installing or running the biomodal CLI this way for production use. There are several important reasons for this:

File size limitations: Local Conda environments often enforce lower limits on maximum file and path sizes, which can cause problems during large-scale analysis
Data duplication: Using a local Conda environment may result in duplicated software resources across environments, increasing disk usage unnecessarily
Lack of central management: Installing the CLI via Conda means it is not centrally configured or controlled, which can introduce inconsistencies across users and compute nodes in the cluster

Important

For consistency, reliability, and optimal performance, we strongly recommend running the biomodal CLI using the admin installation script (see Admin installation (recommended)) at the network or shared software module level. This ensures environment variables, shared directories, and version updates are centrally managed across all users and nodes.

Executing the scripts will ask you to confirm that you have read the documentation and understand the implications of running this script on your system.

Please confirm that you have read the documentation and
understand the implications of running this script on your system.

Are you sure you want to run this script? (y/n)

Important

Only use this if you are experienced with Conda and your cluster’s Linux environment. This method does not allow for shared installation and may lead to redundant storage usage.

Do not assume that all steps in these scripts will be an exact match for your local server or HPC nodes. If you choose to use the biomodal-hpc-utils-conda script, we recommend you use an existing working conda environment.

Note

Singularity or Apptainer must be installed outside of the Conda environment, by a root or sudo user.

We recommend using tmux or screen to keep sessions alive for long-running tasks. See Preventing timeout for more information.

To begin run the following script:

./biomodal-hpc-utils-conda

Follow the prompts and answer the required questions

CLI installation path

Enter cli installation path, must be a directory you have
read/write/execute access to. : /biomodal

This is where the CLI software will reside on your system. Ideally choose a central location that users on the system can access if needed.

Default: /biomodal
Alternative: Choose any directory with enough free space and write access.

Note

The HPC users that will run the biomodal pipeline will require read and execute access to this location

Data bucket path

Enter data bucket path - shared location where ref data and images
will be stored:

/biomodal/data_bucket

This path will store reference data, pipeline images, and shared resources.

Default: /biomodal/data_bucket
Alternative: Choose a shared network storage location, if working in a group

This location can be re-used for other biomodal runs or installations in the future

Choose Docker registry location closest to your physical location or data centre. This ensures faster downloads and fewer errors during pipeline execution.
```
Choose relevant geographical location of the biomodal docker registry
(1-3)?

1) asia
2) europe
3) us

#?
```

Set max concurrent jobs

Current queueSize is: 200

Would you like to change it? (y/n)

This controls how many jobs Nextflow can run at once. If jobs keep failing or hanging, try lowering this value later.

Yes: Enter a lower number if you’re working with limited HPC resources
No: Keep the default of 200 if you’re unsure

Select job scheduler (executor)

Which executor would you like to use? (choose 1-4)

1) slurm
2) lsf
3) sge
4) local
#?

Select the job scheduler used by your HPC system:

slurm, lsf, or sge: Choose based on what your HPC uses
local: Use this only if you’re running everything on a single server without a scheduler

If you have not activated a Conda environment before starting the installation, miniconda will be installed. You will see the following message and prompted to provide a location for your new Conda environment.

conda not found, installing miniconda.

Enter dependency installation root directory (default: home directory)

The script will now complete the installation of the remaining third-party software.

Note

Approximately 150.000-160.000 files (external version dependent at the time of installation) will be installed in your Conda environment.

Once the script finishes you may be prompted to log out and back in again to apply the changes.

Your system is now ready to import and run the biomodal duet pipeline.

HPC set up recommendations#

Caution

This section contains information intended for system administrators

Overview#

When running the duet pipeline on HPC clusters, it’s important to configure your environment correctly to avoid job failures and ensure optimal performance. This section outlines recommended settings for managing temporary directories, scratch space, memory and CPU limits, and scheduler-specific configurations.

These shell environment settings/variables should be added to your cluster environment configuration files. These include bashrc, .bash_profile, .profile (or similar) in order to ensure they are set correctly when the jobs are submitted on your cluster.

Note

Defining these variables within your startup scripts will define these variables for the respective user, and so could impact other pipelines using Apptainer/Singularity. Please consult with your cluster administrator before making these changes.

If not using the duet software for an extended period of time, it is recommended to remove or comment out these lines until resuming analysis with the duet software pipeline.

Important

Always validate changes by running biomodal test before processing full datasets.

Confirm with your system administrator if you’re unsure about your scheduler’s default behaviour.

Allocating sufficient space for temporary files#

The pipeline (especially when using Singularity or Apptainer containers) creates large temporary files in /tmp. If your cluster has limited space in this directory, jobs may fail due to insufficient disk space. We recommend the following:

Bind a larger, custom temporary directory in your nextflow.config:
```
singularity {
  enabled = true
  autoMounts = true
  runOptions = "--bind /scratch/tmp:/tmp"
}
```
This tells Singularity: “Use /scratch/tmp for your temp files instead of the default /tmp.

Alternatively, use environment variables for flexibility:

singularity {
  enabled = true
  autoMounts = true
  envWhitelist = "TMPDIR,SINGULARITY_TMPDIR,SINGULARITY_CACHEDIR,NXF_SINGULARITY_CACHEDIR"
  runOptions = '--bind "$TMPDIR:/tmp"'
}

Set the following in your shell environment:

export TMPDIR=/your/larger/tmp
export SINGULARITY_TMPDIR=$TMPDIR

(Optional) to manage where containers are downloaded and stored, define:

export SINGULARITY_CACHEDIR=/your/cache/dir
export NXF_SINGULARITY_CACHEDIR=$SINGULARITY_CACHEDIR

Ensure that this cache directory matches the libraryDir setting in nextflow.config.

Additionally, try adding wider cluster options to the process section in nextflow.config if you are still experiencing issues related to temporary files:

process {
  executor = "slurm"
  containerOptions = "--bind /<your local tmp path>:/tmp"
}

Important

We strongly recommend that both TMPDIR and SINGULARITY_TMPDIR environment variables are set on the head/controller node before any jobs are submitted. This should ensure that the temporary files are written to the correct location.

If you don’t do this, and your /tmp directory fills up, you may see an error like this:

Error in tempfile() using template /tmp/parXXXXX.par:
Parent directory does not exist

Error in tempfile() using template /<your local tmp path>/parXXXXX.par:
Parent directory (/<your local tmp path>/) does not exist at /venv/bin/parallel

Please note that the variable names may change slightly depending on whether you are using Singularity or Apptainer:

Singularity variable	Apptainer variable
`SINGULARITY_CACHEDIR`	`APPTAINER_CACHEDIR`
`SINGULARITYENV_TMPDIR`	`APPTAINERENV_TMPDIR`
`SINGULARITYENV_NXF_TASK_WORKDIR`	`APPTAINERENV_NXF_TASK_WORKDIR`
`SINGULARITYENV_NXF_DEBUG`	`APPTAINERENV_NXF_DEBUG`

Ensure your Apptainer installation includes a Singularity symlink in your $PATH and add these variables to .bashrc or equivalent config files.

Per-task or CPU memory reservation for LSF#

Default LSF cluster settings works with a per-core memory limits mode, so it divides the requested memory by the number of requested cpus. If your LSF cluster is configured differently, it is recommended that you try to add the following settings to the LSF executor section in the nextflow.config file in the biomodal script folder.

executor {
  name = "lsf"
  perTaskReserve = false
  perJobMemLimit = true
}

LSF Executor scratch space#

You can dynamically use LSF scratch space per job using the following settings in the nextflow.config file in the biomodal script folder.

process {
  executor = "lsf"
  scratch = "$SCRATCHDIR/$LSB_JOBID"
}

This ensures temporary files are written to isolated, job-specific directories and cleaned up after job completion.

Wall-time and minimum CPU settings#

In some clusters, it is recommended to set a “wall-time”, i.e. the max time a job can run before it is terminated. There may also be a “minimum CPU” and/or queue requirement to start jobs in your cluster. You can set these in your nextflow.config:

process {
  executor = "slurm"
  memory = “16GB”
  cpus = 4
  nodes = 2
  time = "24h"
  params.cpuNum = 1
  queue = "default"
}

Adjust executor, time, queue, and cpuNum to match your local scheduler and policies.

Set queue, CPU, RAM and DISK per module#

The duet pipeline supports fine-grained resource customisation per module using withName: in the process section. This allows you to allocate additional resources for memory- or CPU-intensive steps. For example, you can set the CPU, RAM, or DISK requirements for the BWA_MEM2 and PRELUDE modules as follows:

process {
  // Default settings for all modules
  cpus = 16
  memory = "16GB"
  queue = "default"
  time = "24h"
  disk = "800GB"

  // Specifically overriding settings for a module
  withName: 'BWA_MEM2' {
    memory = "64GB"
    queue = "high-mem"
    time = "48h"
  }

  withName: 'PRELUDE' {
    cpus = 32
    memory = "32GB"
    queue = "large-cpu"
  }
}

Use this to align specific pipeline steps with available HPC queues and avoid resource contention or throttling.

Memory settings for SGE/OGE/UGE clusters#

If you’re on a cluster that doesn’t support memory flags like h_rt, h_rss, or mem_free, you can pass memory requirements via clusterOptions.

Option 1: Apply globally

process {
  cpus = 16
  memory = "16GB"
  queue = "short"
  time = "24h"
  disk = "800GB"
  clusterOptions = { "-l h_vmem=${task.memory.toString().replaceAll(/[\sB]/,'')}" }
}

Option 2: Apply per module

withName: 'BWA_MEM2' {
  cpus = 32
  clusterOptions = "-l h_vmem=64GB"
  memory = null
  queue = "long"
  time = "48h"
}

If your cluster uses per-core rather than per-job memory limits, divide the total memory by the number of CPUs when setting h_vmem.

Manual Installation for HPC users (Advanced)#

Caution

This section contains information intended for system administrators

You can opt to run a manual install of the CLI and Nextflow config files if preferred. This section explains how to manually install and configure the biomodal CLI and Nextflow settings if you’re not using the biomodal-hpc-utils-admin or biomodal-hpc-utils-conda script.

Important

This setup is only intended for system administrators or advanced users who are comfortable managing configuration files and working with HPC environments.

Step 1: Copy the scripts

Follow the instructions in Manual update of the CLI and pipeline to copy the scripts, make them executable, and create the error configuration files into your preferred biomodal cli installation directory (e.g. /biomodal).

Step 2: Create the config.yaml file

The CLI requires a configuration file named config.yaml to define where to install and run the pipeline. Create this file in your chosen biomodal CLI installation directory, with the following content:

platform: hpc
duet_version: 1.4.2
ref_pipeline_version: 1.0.1
bucket_url: <path to your input/output data storage>
work_dir: <path to a local or shared scratch directory>
init_folder: <path where the duet pipeline software will be installed>
biomodal_registry: <see table below>
biomodal_api_baseurl: https://app.biomodal.com
biomodal_releases_bucket: gs://cegx-releases
biomodal_reference_bucket: gs://release-reference-files
share_metrics: <see table below>
error_strategy: FailFast

Key information:

Content	Detail
`bucket_url`	Directory or bucket where input/output data is stored
`work_dir`	A fast, temporary directory used during execution. It should have ~3x the input data size available
`share_metrics`	If set to `true`, events and the duet pipeline metrics report will be shared with biomodal after a successful analysis run. For more information, please see Sharing event & metric data with biomodal
`init_folder`	The location where the duet pipeline and workflow tools will be stored
`biomodal_registry`	Choose one based on your region `asia-docker.pkg.dev/cegx-releases/asia-prod` `europe-docker.pkg.dev/cegx-releases/eu-prod` `us-docker.pkg.dev/cegx-releases/us-prod`

Step 3: Create the nextflow.config file

This file is required to tell Nextflow how to execute jobs on your system and where to find container images. This local configuration file is applied after any settings defined in the standard duet pipeline configuration files.

Nextflow will automatically cascade the settings in the duet pipeline configurations, followed by the nextflow.config in the versioned biomodal script folder within the init_folder directory (see table above). This ensures that your customised settings are always retained in the nextflow.config file in the biomodal script folder when upgrading to any new releases of the duet pipeline.

Create nextflow.config in the biomodal CLI installation directory and with the following content:

singularity {
  enabled    = true
  autoMounts = true
  libraryDir = "<starts with same path as 'bucket_url' in config.yaml> + '/singularity-images'"
}
params {
  registry = "<see below>"
}
process {
  executor = "slurm"  // or lsf, sge, or local
}
report {
  overwrite = true
}

Choose one registry based on your region:

Content	Detail
`registry`	`asia-docker.pkg.dev/cegx-releases/asia-prod` `europe-docker.pkg.dev/cegx-releases/eu-prod` `us-docker.pkg.dev/cegx-releases/us-prod`

Content

Detail

registry

asia-docker.pkg.dev/cegx-releases/asia-prod

europe-docker.pkg.dev/cegx-releases/eu-prod

us-docker.pkg.dev/cegx-releases/us-prod

Fill the process section according to your HPC job scheduler (select local if no HPC scheduler is present) to one of the following:

process {
  executor = "slurm"
}

process {
  executor = "lsf"
}

process {
  executor       = "sge"
  process.penv   = "smp"
  clusterOptions = "-S /bin/bash"
  //If your system is not supprting h_rt/h_rss/mem_free settings,
  //you can try to use the following settings:
  //clusterOptions = { "-l h_vmem=${task.memory.toString().replaceAll(/[\sB]/,'')}" }
}

process {
  executor = "local"
}

These settings are layered on top of the duet pipeline’s own configuration and will remain active across pipeline upgrades. For more information, see HPC set up recommendations.

Using the CLI#

This section walks you through how to use the biomodal CLI, from running commands and importing the pipeline. It also includes input file requirements, troubleshooting tips, and details on expected outputs.

Important

Please read through this entire section before starting your analysis and save for future reference.

Command structure and usage#

biomodal

This will display:

biomodal v1.1.3

usage: biomodal [SUBCOMMAND] [OPTION]...

Key CLI commands	Detail
`auth`	Login to biomodal
`init`	Import the duet pipeline, reference files, Docker images, and test data
`test`	Perform a test run using the provided dataset
`analyse`	Run the duet analysis pipeline on your data
`validate`	Check your parameters before launching a run
`call_dmr`	Run the D(h)MR analysis
`reference subcommands`	Manage reference genome data and pipelines
`report`	Share a metrics report with biomodal

Each command has parameters and options. Use the biomodal command for more details.

Input file requirements#

The following sections explain how to prepare your input data and directory structure.

Inputs

An optional metadata file can be used in .csv format detailing samples used in the experiment
A folder named nf-input containing the FASTQ files to be run in the pipeline. The FASTQ files require a specific format for:
- File names
- Sequence identifier
- Inferred instrument type

These inputs are used to create a Read Group used during alignment. For more information, please see FASTQ filename format.

File structure#

The --input-path folder provided for the duet pipeline must contain the nf-input folder (and the metadata .csv file). Your input directory should look like this:

my_input_path
├── biomodal_run_meta.csv
└── nf-input
   ├── Sample-1_S1_L001_R1_001.fastq.gz
   └── Sample-2_S1_L001_R1_001.fastq.gz

Note

For the D(h)MR workflow, please note that the --input-path parameter should point to the full path of the relevant Zarr stores.

For more information on the D(h)MR workflow, see D(h)MR calling workflow.

FASTQ filename format#

FASTQ filenames must follow this pattern satisfying the naming convention used in BaseSpace:

{sample-id}_{sample-number}_{lane}_{R1|R2}_001.fastq.gz

Rules:

No underscores in sample-id
Sample IDs should not start with a number
Must include the sample-number field (even if ignored)

Example: two samples, each pooled across two lanes
CEG900-123-678_S1_L001_R1_001.fastq.gz	CEG900-123-678_S1_L001_R2_001.fastq.gz
CEG900-123-678_S1_L002_R1_001.fastq.gz	CEG900-123-678_S1_L002_R2_001.fastq.gz
CEG900-123-679_S2_L001_R1_001.fastq.gz	CEG900-123-679_S2_L001_R2_001.fastq.gz
CEG900-123-679_S2_L002_R1_001.fastq.gz	CEG900-123-679_S2_L002_R2_001.fastq.gz

FASTQ sequence identifier requirements#

The format of the sequence identifiers in FASTQ files can differ depending upon the software that was used to generate them. The pipeline requires the read identifiers in FASTQ files to comply with the format of the following example:

@A00536:706:HJNLHDRX3:1:2101:2718:1031 1:N:0:ACGGAACA+ACGAGAAC

The table below describes each of the identifiers and their purpose:

Example data	Description
`A00536`	Instrument name
`706`	Run ID
`HJNLHDRX3`	Flowcell ID
`1`	Flowcell lane
`2101`	Tile number
`2718`	x-coordinate
`1031`	y-coordinate
`1`	Member of a pair
`N`	Filtered
`ACGGAACA+ACGAGAAC`	Index sequences

Inferred instrument type#

The pipeline will infer the instrument type from the instrument name extracted from the first read in each FASTQ file based on the following convention:

Read ID begins with	Instrument inferred as
@A	NovaSeq6000
@LH	NovaSeqX
@VL	NextSeq1000
@VH	NextSeq2000
Others	Unknown

Use of the extracted flowcell ID and flowcell lane#

The sample ID extracted from the FASTQ filename, and the flowcell ID and flowcell lane extracted from the read ID of the first read in the FASTQ file will be used to construct a Read Group which will get passed into the aligner during the alignment step.

The following example shows the Read Group (@RG) generated for sample ID CEG900-123-678, flowcell ID HJNLHDRX3 and flowcell lane 1:

@RG ID:HJNLHDRX3.1 PL:ILLUMINA PU:HJNLHDRX3.1.CEG900-123-678
LB:CEG900-123-678 SM:CEG900-123-678

Metadata file#

The pipeline can use an optional metadata file in .csv format, with one column per sample. The remaining cells in row 1 should contain sample IDs, and the remaining cells in column A should contain metadata field names. There are no requirements about what these field names are, nor the quantity of them.

The pipeline assumes that this file will be encoded in ASCII and that fields and field names will not contain any commas.

Must be named in the --meta-file argument
Must be placed in the same folder as the FASTQ files

The filename can be anything, but the --meta-file argument must match exactly to the name of the metadata file. Please only use the metadata filename, not the full path.

The metafile will not control which samples are analysed in the duet pipeline, all samples in the nf-input directory in your --input-path location will be processed.

Optional metadata file in .csv format example:

sample_id,CEG900-123-678,CEG900-123-679,CEG900-123-680,CEG900-123-681
sample-condition,case,control,case,control
sample-sex,male,male,female,female
lab-technician,JM,JM,ST,ST
project-id,X001,X001,X002,X002

Before you start#

Avoid session timeouts in long analysis runs#

To avoid session timeouts during long analysis runs, use a terminal multiplexer. For more information, please see Preventing timeout.

Check processing and storage requirements#

For more information on the minimum requirements for each module in the pipeline, please see Module requirements.

Please check the minimum resource requirements to ensure you allocate sufficient CPU, RAM, and disk space depending on your chosen analysis profile. Your cloud tenancy or HPC cluster should be able to allocate a set of VMs/nodes with these resources.

You can edit resource requirements by referencing specific modules in the nextflow.config file in the biomodal script folder to accommodate available hardware resources. Please see examples in the following sections for more information on how to increase or decrease the resource requirements. Please note that the duet pipeline is less likely to run successfully if the minimum resources are too restrictive.

If you have any questions, please contact support@biomodal.com.

Elevated resource profiles#

Each module file in the duet pipeline contains default resource settings in relation to CPUs, memory, and disk. These settings are intended to be sufficient for shallow sequencing runs where you have up to 50 million reads per sample per lane.

On deeper sequencing runs, you should invoke an elevated resource which will allocate additional resources to jobs. For the deepest sequencing runs in UMI-mode, it is also necessary to invoke a CLI parameter that will distribute the duplicate identification step across intervals/contig. The parameters to use are described in the table below:

Reads per sample per lane	Profile	Additional parameter
Up to 50 million	Default	None
Up to 500 million	Deep_seq	`--additional-profile deep_seq`
Over 500 million	Super_seq	`--additional-profile super_seq`

If using UMIs and very deep sequencing (>2B reads):

--additional-params dedup_by_contigs=true

Important

Invoking any of these profiles will require additional CPU, RAM, and disk resources to the duet pipeline. Please ensure you have sufficient storage and credit prior to starting any runs.

Limit local resources#

If you have to limit the duet pipeline CPU and RAM resources for the local executor on a single machine, you can use the cpu_num and memory parameters in conjunction with the local or local_deep_seq profiles.

Note

Please note this only works with the local executor, so this will not have any effect if you are using HPC or cloud-based executors.

Example for constrained local machine (limited to 32 CPUs and 128GB RAM):

--additional-profile local_deep_seq --additional-params cpu_num=32,memory="128GB"

Import the pipeline with biomodal init#

Before you start#

Once your cloud or HPC environment is set up, you can import the duet pipeline and all required resources using the biomodal init command.

To get started, run the following:

biomodal init

This command will:

Download the latest version of the duet pipeline and D(h)MR workflow (for more information on the D(h)MR workflow, see here)
Import all required reference files
Import the biomodal test dataset (RunXYZ)
Download and cache the required Docker or Singularity containers
Prepare the system for a successful test run and future analysis
Prompt you to enable metrics sharing

Important

Ensure you have authentication set up with your cloud provider if using GCP, AWS, or Azure. Terraform must also be configured if you are using the biomodal-cloud-utils setup method.

Important

Please do not attempt to run biomodal test or biomodal analyse if you experienced any problems during the bootstrapping or biomodal init process.

biomodal init must always be successfully completed after a fresh install and after any manual update of the CLI and/or duet pipeline.

Please review Elevated resource profiles to ensure you allocate sufficient hardware resources relative to the number of reads per sample before running the duet pipeline.

Running the biomodal init command#

During biomodal init, you will be prompted to:

Enable event and metrics sharing with biomodal

Would you like to share the pipeline metrics report at
the end of all successful analysis runs with biomodal? (1-2)

1) Yes
2) No
#?

If enabled, performance metrics are shared to support ongoing improvements and help biomodal troubleshoot issues more effectively. No sample data is shared. For more information, please see Sharing event and metric data with biomodal.

Specify the duet pipeline installation location

The duet software location 'init_folder' property is not
currently defined in the configuration file

Would you like to set it to /<home_folder>? (1-2)

1) Yes
2) No
#?

This sets where the CLI and pipeline components will be stored.

Choose and error handling strategy
- Normal: Retry failed jobs up to 10 times
- FailFast: Stop early if jobs fail
```
Please select which duet pipeline error strategy you would like to use:

1) Normal
2) FailFast
#?
```

At this stage you will see the following downloads:

Genome reference data (unless already downloaded)
Container images (unless already downloaded)
Pipeline software
Small test dataset

Once complete, the CLI will download all files and verify the setup. If any downloads are interrupted, you can clear the partial data and re-run biomodal init.

You’re now ready to proceed with a test run to confirm everything is working correctly.

Expected output#

Once you start a run (e.g. with biomodal test), you’ll see an output like this:

N E X T F L O W ~ version 24.10.6
Launching /opt/apps/biomodal/latest/biomodal-duet-1.4.2/main.nf [boring_pasteur] DSL2 - revision: 65b941077a
executor > slurm (21)
[- ] resolve_align:FASTQC_RAW -
[1e/69d214] resolve_align:PRELUDE (CEG9330132-19-01, L001) [100%] 1 of 1 ✔
[- ] resolve_align:FASTQC_RESOLVED -
[b8/c8c72b] resolve_align:BWA_MEM2 (CEG9330132-19-01, L001) [100%] 1 of 1 ✔
[- ] resolve_align:SAMTOOLS_MERGE_LANES -
[- ] resolve_align:PRESEQ -
[d5/ec7739] bamlet_v2:ONE_STEP_BAMLET (CEG9330132-19-01) [100%] 1 of 1 ✔
[6e/df7d5c] bamlet_v2:QUALIMAP_BAMQC (CEG9330132-19-01) [100%] 1 of 1 ✔
[80/db515c] bam_stats_genome:TSS_BIAS (CEG9330132-19-01) [100%] 1 of 1 ✔
[- ] bam_stats_genome:DEEPTOOLS_PLOTFINGERPRINT -
[3a/292078] bam_stats_genome:PICARD_GCBIAS (CEG9330132-19-01) [100%] 1 of 1 ✔
[42/ff20ec] bam_stats_long_ctrl:LONG_CTRL_GENETIC_ACCURACY (CEG9330132-19-01) [100%] 1 of 1 ✔
[e2/880eac] variant_calling:HAPLOTYPE_CALLER (CEG9330132-19-01) [100%] 1 of 1 ✔
[- ] variant_calling:ASM -
[20/d45b69] mbias:GENOME_MBIAS (CEG9330132-19-01-CG-5-prime) [100%] 1 of 1 ✔
[08/442714] epiquant_v2_genome:EPIQUANT_BAM2ZARR (CEG9330132-19-01-CG) [100%] 1 of 1 ✔
[9f/594e09] epiquant_v2_genome:MODALITY_EXPORT (CEG9330132-19-01_CG) [100%] 1 of 1 ✔
[77/374a22] epiquant_v2_genome:JOIN_ZARR_STORES (CEGX_RunXYZ-genome-CG) [100%] 1 of 1 ✔
[b8/89e666] epiquant_v2_short_ctrl:EPIQUANT_BAM2ZARR (CEG9330132-19-01-CG) [100%] 1 of 1 ✔
[4e/73c13b] epiquant_v2_short_ctrl:MODALITY_EXPORT (CEG9330132-19-01_CG) [100%] 1 of 1 ✔
[- ] epiquant_v2_short_ctrl:JOIN_ZARR_STORES -
[8f/35d181] epiquant_v2_long_ctrl:EPIQUANT_BAM2ZARR (CEG9330132-19-01-CG) [100%] 1 of 1 ✔
[c4/74201b] epiquant_v2_long_ctrl:MODALITY_EXPORT (CEG9330132-19-01_CG) [100%] 1 of 1 ✔
[1c/575553] epiquant_v2_long_ctrl:JOIN_ZARR_STORES (CEGX_RunXYZ-long_ctrl-CG) [100%] 1 of 1 ✔
[e0/6821c2] summary:DQSPY (CEGX_RunXYZ) [100%] 1 of 1 ✔
[31/f8e5ed] summary:PIPELINE_REPORT (CEGX_RunXYZ) [100%] 1 of 1 ✔
[2b/9cc002] summary:DQSREPORT [100%] 1 of 1 ✔
[ad/d278e7] summary:MULTIQC (CEGX_RunXYZ) [100%] 1 of 1 ✔
Completed at: 15-Apr-2025 15:52:10
Duration : 33m 51s
CPU hours : 5.3
Succeeded : 21

Output directories overview#

Outputs from the Nextflow pipeline get published to your cloud storage bucket or output directory, organised into a directory structure.

Top-level:

gs://my-org-nextflow/test-runxyz

Subdirectories:

Folder	Description	Detail
`nf-input`	FASTQ input files	FASTQ files should be copied into this directory prior to launching the pipeline. Please note that all FASTQ files undergo analysis regardless of sample info in the meta file.
`nf-work`	Intermediate files, logs	This contains logs and temp data, organised by Nextflow job ID. This is useful for debugging.
`nf-results`	Final analysis outputs	This is the directory where the outputs of the pipeline are stored, organised by pipeline run, sample, pipeline module. This directory is described further below.

nf-work folder#

This is the working directory for Nextflow where logs and files staged for downstream processes are stored. There will also be a directory that includes the pipeline version and the tags that you set when launching the pipeline.

The directory structure will comprise of hashes which match those displayed in Nextflow and which uniquely identify specific jobs launched by Nextflow. For example:

gs://my-org-nextflow/test-runxyz/nf-work/duet-1-0.1.0_2021-05-15_1656_5bp/01/af2b9a7434a6ca435b96c6b84cb9a2

This directory is useful for debugging the pipeline, examining logs and viewing the STDOUT from jobs. Inside this directory there will be subdirectories associated with each pipeline execution.

If the pipeline is running smoothly on your runs, you will rarely need to look in the nf-work directory. If a pipeline has completed successfully and you have no intention of resuming it with modified settings or examining logs, you can delete the contents of the associated subdirectory inside the nf-work directory.

nf-results folder#

Subdirectory	Contents
`reports`	Sample-level and multi-sample reports summarising information about the samples and controls
`sample_outputs`	Primary data files generated by the pipeline (described in more detail below)
`controls`	BAM files and quantification files associated with the methylated lambda and unmethylated pUC19 controls. These small files are analogous to the BAM files and quantification files generated for your samples and may be useful for familiarising yourself with the file formats. Note that there is an accompanying FASTA file for the controls in the reference file directory with the following name/location: `ss_ctrls/v24/ss-ctrls-long-v24.fa.gz`
`diagnostics`	Secondary outputs from the pipeline, including a parameters log recording the parameters that were used to execute the pipeline, more extensive metrics to support more detailed data investigations, and the interim resolved FASTQ files that were passed into the aligner

Key outputs#

reports/summary_reports: aggregate QC reports of all samples
reports/sample_reports: HTML QC reports per sample
sample_outputs/bams: Deduplicated BAM files
sample_outputs/modc_quantification: quantification files reporting modified cytosines
sample_outputs/variant_call_files: VCF files

For a more detailed overview of the duet pipeline outputs and explanation of the formats, please see the Bioinformatics data interpretation guide.

Removing data from the temporary pipeline folders#

The temporary pipeline folders created by the duet pipeline can be removed upon successful completion of the pipeline run. If a pipeline has completed successfully and you have no intention of resuming it with modified settings or examining logs, you can delete the contents of the associated subdirectory inside the nf-work directory. Please see Output directory overview or more details about the pipeline folder structure.

Checking your set up: Performing test runs#

Before analysing your own data, it’s important to confirm that the pipeline is correctly installed and that your environment has been set up properly. The test runs below ensure that:

The duet pipeline and CLI are working as expected
Reference files and containers are accessible
Your compute environment has appropriate resources and permissions
Any issues with configuration, dependencies, or file paths are identified early

Run biomodal test#

Important

You should run the following tests immediately after completing biomodal init. If the test fails, do not proceed with analysing your own data.

After biomodal init completes without errors, run the following to verify your setup:

biomodal test

Example output at the end of the test run:

Completed at: 10-Mar-2023 12:51:33
Duration : 45m 15s
CPU hours : 10.0
Succeeded : 21

If the test fails:

Check your internet connection, especially if you’re using cloud resources
Verify file paths in your config.yaml are correct and accessible
Check authentication: rerun biomodal auth if credentials may have expired
Inspect logs in the nf-work directory for errors

Reach out to support@biomodal.com with the test output and any relevant log files.

Run an example analysis with a test dataset (RunXYZ)#

The RunXYZ dataset is a small, pre-configured test dataset provided by biomodal as part of the pipeline installation. It includes a minimal set of sample files and metadata to simulate a typical pipeline run. Running this test helps you verify that:

Your environment (local or cloud/HPC) is configured correctly
Required dependencies (e.g. Docker/Singularity, Nextflow) are working
Reference data and container images are accessible
Your CLI config (e.g. config.yaml) and nextflow.config are valid
File paths and permissions are correctly set up

This test typically takes 30-60 minutes depending on your environment. It requires significantly fewer resources than a full pipeline run and does not generate large intermediate or final files.

To run the test, enter the following command:

biomodal analyse \
  --input-path /biomodal/data_bucket/test-runxyz \
  --meta-file CEGX_Run_meta.csv \
  --output-path /biomodal/data_bucket/test-runxyz \
  --run-name CEGX_RunXYZ \
  --tag CEGX_RunXYZ \
  --additional-params lib_prefix=CEG9330132-19-01_S12_L001 \
  --mode 5bp

Once completed, check the nf-results directory for output files and reports. This confirms that your system is ready to analyse your own data.

Please note that this manual test script will allow resuming the test analysis if/when it is interrupted.

Important

Make sure biomodal init has completed before running this command. If the dataset wasn’t downloaded or configured correctly, re-run biomodal init.

Test with larger GIAB dataset (Optional)#

To simulate real-world conditions, biomodal recommends testing the pipeline with a larger publicly avaiable duet dataset from Genome In A Bottle (GIAB).

Important

This test will incur additional runtime and/or storage costs, particularly if run on cloud infrastructure. Ensure that your cloud environment has sufficient resources and budget allocated before starting the GIAB test.

Download the GIAB dataset (either +modC or evoC) which can be found here: https://biomodal.com/kb_articles/demo-data-instructions/
Run the pipeline:

biomodal analyse \
  --input-path  <path_to_giab_data> \
  --meta-file CEGX_Run_meta.csv \
  --output-path <path_to_output> \
  --additional-profile deep_seq \
  --tag giab_demo_data \
  --mode <5bp|6bp>

The mode selected should be 5bp or 6bp dependent on the demo data selected.

If you have any issues, please contact support@biomodal.com.

Troubleshooting and optimisation#

This section outlines common issues and performance optimisation strategies. If you have any questions or need further help, please reach out to support@biomodal.com.

Reduce disk usage#

The output footprint of a full pipeline run is typically ~1.4x the size of the input FASTQ files. You can reduce this by:

Disable publishing of resolved FASTQs#

Resolved FASTQ files are large (>50% of output) and not required for downstream analysis. To avoid publishing them:

--additional-params publish_by_default.prelude=false

Output CRAMs instead of BAMs#

The duet pipeline outputs sequence alignments as BAM files by default but offers the option to output genome sequence alignments as CRAM files. CRAMs will be outputted with a reference genome FASTA (which is required to read CRAMs). CRAMs save ~40% disk space compared to BAMs. To publish CRAMs:

--additional-params publish_crams=true

Resuming a disrupted pipeline#

If your run was interrupted, rerun the exact same command with:

--resume

This might be useful if your terminal session is accidentally terminated, such as by a break in network connectivity. Please ensure that you use the same parameters as the previous pipeline when resuming.

Low GCP CPU quotas#

If using GCP with limited CPU quotas (e.g. free trial accounts), please contact support@biomodal.com as a custom resource profile may be needed.

Orchestration VM#

Your orchestration VM only needs to run when launching pipelines. You can stop it from the clound providers console when inactive, and start again when you need to launch another analysis run.

Updating the biomodal pipeline#

This section explains how to check for new versions of the duet pipeline and update both the CLI and D(h)MR workflow. For more information on the D(h)MR calling workflow, please see here.

Important

Updating the pipeline may require super-user admin privileges depending on your system setup. Please check with your system administrator if you are unsure.

Using the CLI to update the pipeline#

To check for available duet pipeline versions, run the following:

biomodal list

This will list all available versions of the duet pipeline software. Please be aware that some may be older than your current local version.

To install a newer version of the duet pipeline software please run:

biomodal download <version>

This updates containers, reference files, and test data for the selected version. For more information, please see Import the pipeline with biomodal init.

Note

This will overwrite the local files for the specified version. Other installed versions will not be affected.

You can also check for available updates with:

biomodal info

To directly update to the latest version of duet, and change your local settings, you can run:

biomodal init

This updates any updated containers, reference files, and test data for the selected version. For more information, please see Import the pipeline with biomodal init.

Manual update of the CLI and pipeline#

You can manually update to CLI 1.1.3 and duet 1.4.2, or newer:

Backup your installation folder
Download and unzip the new CLI version (See Download the installation files)
Copy the following files from ./cli into your local installation directory:
```
_biomodal_functions.sh
_biomodal_validate.sh
biomodal
```
Update failfast.error.config and retry.error.config with new settings.

HPC version

failfast.error.config:

process {
  errorStrategy = { sleep(Math.pow(2, task.attempt) as long); return task.exitStatus in [14] ? 'retry' : 'terminate' }
}

retry.error.config:

process {
  errorStrategy = { task.attempt.toString().isInteger() ? sleep(Math.pow(2, task.attempt) as long) : sleep(1) ; return task.exitStatus in [14] ? 'retry' : 'ignore' }
  maxRetries    = 10
}

Cloud version

failfast.error.config:

process {
  errorStrategy = { task.attempt.toString().isInteger() ? sleep(Math.pow(2, task.attempt) as long) : sleep(1) ; return task.exitStatus in [0,10,14,50001,50002,50003,50004,50005,50006] ? 'retry' : 'terminate' }
  maxRetries    = 1
}

retry.error.config:

process {
  errorStrategy = { task.attempt.toString().isInteger() ? sleep(Math.pow(2, task.attempt) as long) : sleep(1) ; return task.exitStatus in [0,10,14,50001,50002,50003,50004,50005,50006] ? 'retry' : 'ignore' }
  maxRetries    = 10
}

Make the scripts executable:

chmod +x biomodal _biomodal_functions.sh _biomodal_validate.sh

Download the latest duet version

biomodal download 1.4.2

Alternatively, Run biomodal init to update your local preferences and verify that you have all the required container images and reference files

Edit your config.yaml to add:

ref_pipeline_version: 1.0.1

Add the following to nextflow.config:

report {
  overwrite = true
}

Run a test to validate the update:

biomodal test

You should now be ready to begin using the updated software.

Advanced analysis features#

CHG and CHH modification calling#

By default, the pipeline calls modified cytosines in CpG context only. However, the pipeline includes the capability to call modified cytosines at CHG and CHH sites (where H is the IUPAC disambiguation code representing a DNA nucleotide other than G). To invoke CHG and CHH modification calling, add the following flag when launching the pipeline:

--chg-chh-contexts true

Mapping Quality Filtering#

By default, the duet pipeline discards reads from the BAM file with a MAPQ (Mapping Quality) score of 0. However, you can adjust the stringency of this filter as needed.

To include all reads regardless of mapping quality:

--additional-params bamlet.min_map_qual=0

To retain only reads in the BAM with MAPQ >=20, thereby discarding lower-confidence mappings:

--additional-params bamlet.min_map_qual=20

The pipeline also supports more stringent MAPQ filtering specifically for reads used in epigenetic quantification. For example, to keep reads in the BAM with MAPQ >=1 but restrict epigenetic quantification to reads with MAPQ >=20:

--additional-params bamlet.min_map_qual=1,epiquant.min_map_qual=20

Epigenetic Quantification Options#

Epigenetic Quantification Output Formats#

In addition to the zarr store, the duet pipeline is able to produce four types of plain-text epigenetic quantification outputs; bedmethyl, cxreport, bismark or bedgraph. The default is to produce cxreport, but any combination can be requested. To produce cxreport and bismark:

--quantification-output cxreport,bismark

Limit epigenetic quantification to primary chromosomes#

By default, the duet pipeline will quantify epigenetics at each CG context present on contigs included in the primary assembly BED (if the species has an available primary assembly BED). In GRCh38 Human reference this includes unlocalised scaffolds (e.g. chr1_KI270706v1_random). To exclude these scaffolds from epigenetic quantification (note they will still be present in the BAM):

--additional-params epiquant.primary_chromosomes_only=true

Include SNVs in CG coverage#

By default, the pipeline counts total coverage at a reference CG position as the number of Cs (modified or unmodified) at a given position. However, you can optionally include A, G and T base calls in total coverage:

--additional-params modality.export.only_c_in_coverage=false

Note the total coverage in methylation quantification outputs will be used as the denominator to calculate percenatge methylated/hydroxymethylated.

Output uncovered CG contexts#

By default any reference CG contexts which are uncovered in a given sample will be omitted from the plain-text quantification outputs. To include zero-coverage CG contexts in quantification outputs:

--additional-params modality.export.skip_rows_zero_counts=false

Variant Calling and Variant Associated Methylation#

The duet pipeline contains modes for calling germline variants using GATK HaplotypeCaller and for calling somatic variants using the GATK Mutect2.

Additionally, an Allele Specific Methlation (ASM) module can be activated which will evaluate each heterozygous germline variant for the presence of methylation differences between the two alleles.

Germline Variant Calling#

By default, germline variant calling is performed on each sample individually using GATK HaplotypeCaller generating a separate VCF file for each sample.

Germline variant calling can be disabled by setting the following parameter when launching the pipeline:

--additional-params call_germline_variants=false

Joint Variant Calling#

The pipeline also supports joint variant calling across multiple samples. Joint variant calling can be activated by adding the following parameter when launching the pipeline:

--use-gvcf true

This will cause the GATK HaplotypeCaller to generate per-sample gVCF files (instead of VCF files) which will then be consolidated via GATK GenomicsDBImport followed by joint genotyping using GATK GenotypeGVCFs.

Allele Specific Methylation#

The Allele Specific Methylation (ASM) module can be activated to evalute each heterozygous germline variant for the presence of methylation differences between the two alleles. The ASM module is disabled by default, but can be activated by adding the following parameter when launching the pipeline:

--compute-asm true

To perform ASM calling, variants will be ‘atomized’ using bcftools norm so that complex variants are decomposed - i.e. MNVs will be split into consecutive SNVs. An ASM file will be generated for each sample.

If the --compute-asm true parameter is combined with the --use-gvcf true parameter described above, an ASM file will still be generated for each sample, but ASM calling will be performed using the atomized outputs from joint variant calling. A subsequent step will combine per-sample ASM calls into a single file.

Somatic Variant Calling#

The pipeline also supports somatic variant calling which is performed using GATK Mutect2 followed by GATK-recommended filtering using FilterMutectCalls. This is performed in the “tumour-only” mode of Mutect2 (i.e. it is not run using pairing of a tumour / matched normal).

Somatic variant calling can be activated by adding the following parameter when launching the pipeline.

--additional-params call_somatic_variants=true

The filtering of somatic variants using FilterMutectCalls happens by default if somatic variant calling is activated, but filtering can be disabled by adding the following parameter when launching the pipeline:

--additional-params filter_somatic_variants=false

Subsampling#

If you wish to subsample the reads in your sample prior to processing (such as to ensure equal numbers of reads in each sample for inter-library comparison purposes), then this is possible by adding the following parameter and indicating the maximum number of reads that you want to retain on a single lane for a single sample:

--additional-params subsample=50000000

Important

This subsampling is performed per sample, per lane, so in the above example, if there were 4 lanes, then this would limit the processing to 50M reads per sample per lane, so 200M reads in total per sample.

Alternative Supported Reference Genomes#

By default the duet pipeline will align against a reference file containing the human reference genome GRCh38 and the sequences of the spike-in controls supplied with the kit. It is also possible to run the pipeline relative to other available reference genomes (e.g. mouse) first by downloading the preferred reference data from biomodal:

biomodal reference list
biomodal reference download GRCm38p6

Then the following flag can be used when launching the pipeline:

--reference-genome GRCm38p6

Alternatively you can create your own reference genome by following the instructions in the make reference genome documentation.

Supported Mouse References#

The duet pipeline supports three versions of the mouse reference; GRCm38 and GRCm38p6 and GRCm39.

GRCm38 (2012): original GRCm38 with additional patch fix contigs, resulting in a large reference (>12 Gbp) requiring more resource to process data. Only use if your study requires detailed patch-level alignment. Note that reads aligning to contigs which are not present in the primary assembly bed (including patch fix contigs) will be discarded by the duet pipeline.

GRCm38p6 (2017): final patch version of GRCm38, incorporating cumulative patch updates into the primary assembly. Use this reference for studies requiring compatibility with legacy GRCm38 tools and datasets.

GRCm39 (2020): latest mouse genome assembly without legacy patches.

Processing individual samples#

Due to resource constraints, there may be a need to process individual samples in the duet pipeline. This will require a smaller amount of CPU and RAM resources as you will analyse each sample separately. You can use the lib_prefix parameter when executing the duet pipeline to specify the prefix of the library to process. You can add the prefix to the --additional-params parameter to your biomodal analyse command as per this example:

--additional-params lib_prefix=SAMPLE1_L001

Enable generation of the FASTQC report#

By default the FASTQ report is not generated in the duet pipeline. To enable generation of the report on your raw FASTQ files, please add the following parameter to biomodal analyse when launching the pipeline:

--additional-params run_by_default.fastqc.raw=true

Target enrichment#

Overview#

The biomodal duet pipeline includes a mode for processing data generated using targeted panels. The analysis pathways for processing targeted libraries are slightly different because the pipeline calculates targeted metrics and makes use of additional reference files describing the bait and target regions associated with the target panel being used.

Processing data using a supported target panel#

Relevant reference files for the following target panels are available by default:

Twist Pan-cancer Methylation Panel
Twist Human Methylome Panel

A targeted analysis for either of these panels can be achieved by passing the following parameters to the biomodal analyse command.

Parameter	Description
`targeted`	Set to `true`
`targeted-panel`	For the Twist Alliance Pan-cancer Methylation Panel set to: `twist_cancer`. For the Twist Human Methylome Panel set to: `twist_methylome`

Here is an example biomodal analyse command with these additional parameters passed in:

biomodal analyse \
  --input-path [path_for_input_files] \
  --output-path [path_for output_files] \
  --meta-file [meta_file_name] \
  --run-name [name_for_run] \
  --tag [date_time_or_other_tag] \
  --additional-profile deep_seq \
  --targeted true \
  --targeted-panel twist_cancer

For any other target panels, it is still possible to perform a targeted analysis, but it is necessary to generate relevant reference files first. The remaining instructions describe how to generate the necessary reference file for alternative target panels and how to launch the pipeline in these circumstances.

Processing data using alternative target panels#

Three additional files are required when running the duet pipeline in targeted mode with alternative target panels. These are the bait and target interval files and a BED file containing the target coordinates. The target and bait intervals are used by gatk CollectHsMetrics to calculate target metrics including target coverage and target enrichment. The BED file is used to filter methylation and variant calls down to the targeted regions of interest.

Generating bait and target interval files#

Bait and target interval files are generated from BED files containing bait and target coordinates with the tool picard BedToIntervalList as shown below:

# Make probes intervals
picard BedToIntervalList I=probes.bed \
O=reference_probe.intervals \
SD=reference.dict

# Make targets intervals
picard BedToIntervalList I=targets.bed \
O=reference_targets.intervals \
SD=reference.dict

Note that this command should be run independently of the biomodal duet multiomics solution CLI in an environment with picard available.

This process requires an additional genome dict file which is created from a FASTA file with your reference genome of interest:

picard CreateSequenceDictionary \
R=reference.fasta \
O=reference.dict

If you are working with the reference files provided with the pipeline, then you can obtain a genome dict file from the following locations in the reference file directory:

Human: duet/duet-ref-0.1.0/reference_genomes/GRCh38Decoy/GRCh38Decoy.dict

Mouse: duet/duet-ref-0.1.0/reference_genomes/GRCm38p6/GRCm38p6.dict

Launching the duet pipeline in targeted mode#

You should use the same profiles and commands when launching the duet pipeline in targeted mode as you would when running the duet pipeline in non-targeted mode, except that there are additional parameters and file paths that need to be supplied as key-value pairs using the --additional-params option when running the pipeline in targeted mode.

Parameter	Description
`targeted`	Set to `true`
`targeted_bait`	The path to the bait interval file generated from the steps described above
`targeted_target`	The path to target interval file generated from the steps described above
`targeted_bait_bed`	The path to bed file with bait coordinates

These additional parameters should be supplied as key-value pairs to the biomodal analyse command when launching the pipeline by using the --additional-params option, as in the following example:

biomodal analyse \
  --input-path [path_for_input_files] \
  --output-path [path_for output_files] \
  --meta-file [meta_file_name] \
  --run-name [name_for_run] \
  --tag [date_time_or_other_tag] \
  --additional-profile deep_seq \
  --targeted true \
  --additional-params targeted_bait=[path_to_file],targeted_target=[path_to_file],targeted_bait_bed=[path_to_file]

Automation#

You can automate the auth and init stages of the biomodal CLI for CI and test purposes.

“biomodal auth” stage#

When running a biomodal auth command you can optionally pass in username and password credentials using the command line flags below:

auth                                Login to biomodal
   -u | --username <email>           username
   -p | --password <password>        password
   --password-stdin                  Take the password from stdin (recommended)

Username and passwords can also be passed in as environment variables BIOMODAL_USERNAME and BIOMODAL_PASSWORD. Environment set variables will always take precedence over CLI applied variables, so please bear this in mind if using this method.

“biomodal init” stage#

When re-running the biomodal init command you can skip prompts for setting your temp directory and init folder location by setting an environment variable IGNORE_INIT_PROMPTS=true. Please note that you have to either edit the config file or follow the prompts the initial time the script is run.

You can also do this by adding ignore_init_prompts: true to your config.yaml file.

You can also skip biomodal CLI prompts for error strategy by setting an error strategy in your config.yaml file.

Normally you’ll be prompted to choose either a normal or failfast error strategy. If you set either one of these values in your config.yaml file the biomodal CLI will skip this prompt:

error_strategy: <normal|failfast>

Alternatively you can set the environment variable ERROR_STRATEGY to normal or failfast.

Finally you can also define whether to share events and metrics reports by either setting the SHARE_METRICS environment variable to true or false, or by setting the value directly in the config.yaml file:

share_metrics: <true|false>

If neither of these are set, the biomodal CLI will require user input as to whether to set these values.

Note Settings in the config.yaml file will take precedence over environment variables. If these values are both set but there are disparities between the two, you may experience unexpected results.

Reference genome pipeline#

You can download the reference genome pipeline from the biomodal CLI. The reference genome pipeline is used to create new reference genomes for the duet pipeline. The reference genome pipeline is a separate pipeline from the duet pipeline and is not required to run the duet pipeline using the default reference genomes provide by biomodal.

Checking available reference pipeline versions#

The reference pipeline is not downloaded during a new biomodal software install by default. To download the files necessary to run the reference pipeline, compatible version(s) of the reference pipeline must be downloaded. To view the list of reference pipeline versions available/compatible that can be downloaded in the next step, run the following command:

biomodal reference pipeline list

Reference pipeline download#

This step will download and set up the necessary files and scripts to run the reference pipeline.

To download a new version of the reference genome pipeline, run the following command:

biomodal reference pipeline download <version>

The version number should be the version you want to download. You can find the available versions by running the biomodal reference pipeline list command.

Please note that if you have already downloaded a version of the reference genome pipeline, downloading an update of the duet pipeline will download updates the reference pipeline should this be required.

Downloading this pipeline requres a new parameter entry ref_pipeline_version in the config.yaml file. This will be automatically added when you download the reference pipeline. Please note that the reference pipeline version is not the same as the duet pipeline version.

Make new reference genomes#

Running the reference genome pipeline#

To make new reference genomes using the reference genome pipeline, run the following command:

biomodal reference make
  --input-path <your-input-path>
  --output-path <your-output-path>
  --species <reference-genome-species>
  --reference-genome <reference-genome-official-name>

Parameter	Description
`--input-path`	Path for the reference genome gzipped FASTA file
`--output-path`	Custom output disk location or bucket url
`--species`	Required: reference species, e.g. `Homo_sapiens`, `Mus_musculus`
`--reference-genome`	Genome Reference Consortium official name, e.g. `GRCh38Decoy`, `hs38DH`

Please note that all parameters are required to run this pipeline.

The reference pipeline will not write directly to the central duet pipeline reference location, but will write to a staging location specified in the --output-path parameter. This enables manual verification of the output before moving it to the reference location. And it prevents the pipeline from needing write access to the reference location.

Reference genome input and output files requirements#

Reference genome input file requirements#

To run the pipeline on a new species, it is necessary to generate new reference files by running the reference file pipeline. This takes a gzipped FASTA file for the reference genome.

With the exception of the new FASTA reference genome file, all of the files above are supplied with the duet pipeline and are assumed to have been downloaded and unpackaged according to the instructions supplied. These files will be automatically detected when the Alternative Reference Genome Pipeline is run.

The reference genome for the species required should be obtained from an appropriate source, e.g. the Genome Reference Consortium or by searching the Ensembl or EnsemblPlants genomes database.

When you have downloaded the gzipped FASTA file, please provide the full location and filename in the --input-path parameter. The pipeline will automatically detect the file and use it to generate the reference genome.

Some further notes about preparing the FASTA reference genome:

Some sources supply reference genomes in FASTA format but not gzipped, for example inside a zip bundle with some metadata files. In these cases, you will need to gzip your FASTA file before you can use it as an input to the reference genome pipeline by running gzip /path/to/my/reference.fasta
The contig names inside the FASTA file (the lines that start with the > character) are important since they are used inside the duet pipeline. Sometimes these contigs have long names; so that the duet pipeline can run effectively, you will need to either rename these contigs to match those you add to the genome config file or supply a mapping to map the long name to the names you supply in the config. For example, your contig name might be >NC_044371.1 Cannabis sativa chromosome 1, cs10, whole genome shotgun sequence and you might want to rename it to >chr1 so that you can just add chr1 to the list of contigs in the genome config file.
Make sure the contigs in the reference FASTA contain what you expect:
- Not all reference downloads contain the mitochondrial chromosome, which might be important for your analysis.
- There may be other contigs that are needed for your reference that are downloaded separately, such as the chloroplasts for a plant genome.
- Many references contain unmapped scaffold contigs; these may be important but should probably be excluded from the list of contigs you supply in the genome config file.
- You can inspect the contigs by running grep ">" /path/to/my/reference.fasta.

Reference genome output file locations#

The output files will be stored in the location specified by the --output-path parameter above. Here you will find the following top level sub-directories:

Location	Description
`reference_genomes/<reference-genome>`	Reference genome output
`ss_ctrls/<ctrls>`	Controls output

In the reference genome output location, the files are:

Location	Description
`<reference -genome>-ss-ctrls-<ctrls>.fa.gz`	Combined reference genome and controls FASTA
`<reference- genome>-ss-ctrls-<ctrls>.fa.fai`	Combined reference genome and controls FASTA index
`<reference-genome>.bed`	BED file containing reference genome contig definitions
` <reference-genome>.chrom.sizes`	List of reference genome contig sizes
`<reference-genome>.chrom.txt`	List of reference genome contig names
`<referenc e-genome>-ss-ctrls-<ctrls>.dict`	List of both reference genome and controls contigs in dictionary format
`<referen ce-genome>_primary_assembly.bed`	Primary assembly BED for reference genome. This will be exactly the same as the reference genome bed, except for human, mouse or zebrafish

After the pipeline has run successfully, the reference genome files, and a set of existing couplet q-tables files and controls, should be copied to the shared reference genomes folder where the existing genomes are held. This is the same location as the default reference genomes provided when you installed the duet pipeline. (<default bucket/folder location of your pipeline data>/reference_files/). Example using GCP storage bucket:

# Set the relevant reference genome version
REF_VERSION=1.0.1

# Copy the reference genome files to the shared reference genomes folder
gcloud storage cp --recursive \
 "gs://bucket-name/reference-output-folder/reference_genomes/GRCz11/*" \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCz11/duet/duet-ref-${REF_VERSION}/reference_genomes/GRCz11/

# Copy existing couplet qtables to the new reference genome folder
gcloud storage cp --recursive \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCh38Decoy/duet/couplet \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCz11/duet/

# Copy existing controls to the new reference genome folder
gcloud storage cp --recursive \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCh38Decoy/duet/duet-ref-${REF_VERSION}/ss_ctrls \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCz11/duet/duet-ref-${REF_VERSION}/

The controls output files are:

Location	Description
`ss-ctrls-long-v24.bed`	BED file containing long control contig definitions
`ss-ctrls-long-v24.chrom.txt`	BED file containing long control contig names
`ss-ctrls-long-v24.chrom.sizes`	BED file containing long control contig sizes
`ss-ctrls-short-v24.bed`	BED file containing short control contig definitions
` ss-ctrls-short-v24.chrom.sizes`	BED file containing short control contig sizes
`ss-ctrls-short-v24.chrom.txt`	BED file containing short control contig names

Note that the controls output files will be exactly the same as those already provided with the duet pipeline.

The original input files are also copied to the reference genome/controls subdirectory, as applicable.

Reference genome config file#

Once the pipeline has been run successfully and outputs checked, the reference genome directory (i.e. <output-path>/references_genomes/GRCz11) should be copied to the reference_genomes folder where the existing genomes are held.

The CLI will generate an config file template. The alternative reference can be used by providing a set of specified fields to the pipeline in the correct format, for example:

params {
  genomes {
    '<<ref_genome>>' {
        contigs          =    ''
        mapping          =    false
        autosomes        =    ''
        mitochondria     =    ''
        primary_assembly =    true
        primary_assembly_bed = "<<ref_genome_path>>/duet/duet-ref-<<ref_pipeline_version>>/reference_genomes/<<ref_genome>>/<<ref_genome>>_primary_assembly.bed"
        target_intervals =    false
        species          =    '<<species>>'
    }
  }
}

This reference genome template file should be edited to include the correct sections for the new reference genome. For any new reference genome, you will need to define these parameters as follows:

Field	Description, Usage and Optionality
`contigs`	A list of chromosomes. This is used: (1) during methylation quantification as a specification of the primary chromosomes over which summary methylation quantification metrics should be calculated: mean CpG methylation rates will be calculated for each chromosome in this list; (2) during variant calling as a specification of the ‘large’ primary assembly contigs. This information is used to support parallelisation of variant calling processes: variant calling will be performed separately in parallel for chromosomes in this list. Variant calling on any other contigs (such as alt contigs or decoy sequences) will be performed in one further parallel step.
`mapping`	A regular expression that can be specified to combine methylation quantification summary statistics from contigs that have a common prefix. If the regular expression matches a contig name, then quantification statistics from that contig will be attributed to the first matching group returned when the regular expression is applied to the contig name. For example, the regular expression `"([^_]+)_.+"` (i.e. multiple characters up to, but not including, the first underscore) when applied to the contig `chr1_KI270706v1_random` would return `chr1` as the first matching group, so would cause quantification data associated with `chr1_KI270706v1_random` to be attributed to `chr1`.
`autosomes`	A list of the autosomes. This is used during methylation quantification to specify the chromosomes that should be used when calculating genome-wide methylation levels. This excludes the allosomes as they would cause a sex-bias in the calculation of genome-wide methylation levels.
`mitochondria`	The name of the mitochondrial chromosome in the reference genome. This is used during methylation quantification to identify the contig over which the mitochondrial methylation rate will be calculated.
`primary_assembly`	A Boolean field indicating whether there is a separate subset of contigs that are considered to be the ‘primary assembly’. This should be set to “false” as a distinct primary assembly is not currently generated for alternative reference genomes.
`species`	Binomial name for the species.
`primary_assembly_bed`	The path to a primary assembly bed file if relevant. Usage is described above.
`<<ref_genome>>`	Reference Genome Consortium name supplied when launching the reference pipeline
`<<species>>`	Species supplied when launching the reference pipeline
`<<ref_genome_path>>`	Automatically applied when launching the reference pipeline. Location of the final reference genome files. This is the shared location expected by the duet pipeline
`<<ref_pipeline_version>>`	Automatically applied when launching the reference pipeline, additionally stored in the CLI `config.yaml` file. This is set during the `biomodal reference pipeline download` stage

The filename will be in the same data-bucket location as for other CLI results and will have the name of the chosen reference genome. Example is Ptrichocarpa_v4-1.config if you launched the pipeline with the reference genome Ptrichocarpa_v4-1.

Module hardware requirements#

Please see the complete overview of module requirements for the reference pipeline.

The module MAKE_BWA_MEM2_INDEX may require a large amount of memory but can be run on a single core. Looking at the BWA_MEM2 documentation, the recommendation is to allocate about 28 x the genome size. For example, the human genome is around 3 gigabases, and therefore requires 3 x 28 = ~84GB RAM. Note that this figure needs to include any additional decoy contigs that are part of the reference fasta you are using.

You may have to add a resource section in the nextflow.config file in the biomodal script folder to specify the hardware requirements for the MAKE_BWA_MEM2_INDEX module in the reference pipeline.

For example:

process {
  withName: 'MAKE_BWA_MEM2_INDEX' { cpus = 1
                                    memory = "320GB"}
}

Using new reference genomes in the duet pipeline#

To use this new reference genome with the biomodal duet +modC evoC multiomics solution bioinformatics pipeline, please include the following parameter to the biomodal analyse command:

--reference-genome <reference-genome>
--reference-genome-profile <reference-genome-profile-file-path>

The --reference-genome parameter should be set to the name of the reference genome you have created. The --reference-genome-profile parameter should be set to the path of the reference genome profile you have created. The default location of this file is in the output directory you specified for the biomodal make reference command.

D(h)MR calling workflow#

Overview#

The D(h)MR calling workflow is used to identify differentially methylated or differentially hydroxymethylated regions between groups of samples analysed using the duet pipeline
The D(h)MR calling workflow uses the Zarr store output from the duet pipeline as an input. It can also take multiple Zarr stores from different executions of the duet pipeline (e.g. from processing different sequencing runs) as an input
The D(h)MR calling workflow requires a sample sheet listing the samples to be used for D(h)MR calling, the condition associated with each sample, and information about covariates that should be taken into account
Samples are grouped by condition and DMR calling is performed between all pairwise combinations of conditions
A typical example might be to call DMRs between a set of cancer samples and a set of healthy samples, accounting for sex and age as covariates
With duet +modC, DMRs are called based on modC (modC is the union of mC and hmC modifications)
With duet evoC, mC is distinguishable from hmC, and DMR calling is performed separately for mC modifications and for hmC modifications. It is also possible to call DMRs based on modC - i.e. on the union of mC and hmC calls

Running the D(h)MR calling workflow#

biomodal call_dmr Run D(h)MR analysis

call_dmr                               Run D(h)MR analysis
--input-path <input_path>              Required: full custom bucket uri or directory path with Zarr stores to be analysed
--dmr-sample-sheet <sample_sheet_name> Required: full path and name of the DMR sample sheet
--output-path <output_path>            Required: custom output disk location or bucket uri
--tag <tag>                            Required: custom tag to identify this DMR run
--mode <5bp|6bp>                       Required: 5 base vs 6 base mode, default is '5bp'
--condition <condition(s)>             Required: a single DMR sample sheet column that contains the conditions between which to call DMRs
--covariates <covariates>              Optional: one or more DMR sample sheet columns that contain covariates to account for during DMR calling
--dmr-bed-path <dmr_bed_path>          Optional: a path to a bed file defining regions that DMR calling should be restricted to
--evoc-modifications <mc|hmc|modc>     Optional: single or comma separated list of duet evoC modifications to call, 'mc' or 'hmc' in 6bp mode only
--min-depth <min_depth>                Optional: contexts will only be removed if coverage <= min-depth in ALL SAMPLES, default is '0'
--window-size <window_size>            Optional: window size for DMR analysis, default is '1000'
--additional-params <params>           Optional: additional single or comma separated parameters, e.g. 'param1=value1,param2=value2'
--run-name <run_name>                  Optional: custom run name to identify this analysis run
--resume                               Optional: resume previous run using successfully completed tasks, default is 'false'
--work-dir                             Optional: override the default path where the workflow temporary data is stored

Small example:

biomodal call_dmr \
  --input-path gs://my-bucket-name/zarr_files/ \
  --output-path gs://my-bucket-name/zarr_files/output \
  --dmr-sample-sheet gs://my-bucket-name/zarr_files/dmr_sample_sheet.tsv \
  --dmr-bed-path gs://my-bucket-name/zarr_files/my_bedfile.bed \
  --mode 5bp \
  --condition disease_state \
  --tag dmr_run1 \
  --run-name dmr_run1

D(h)MR calling workflow input and output file requirements#

D(h)MR calling workflow input file and parameters#

Path to directory containing your Zarr file(s)

This should be a path to a directory containing the multi-sample Zarr store that got generated from running the duet pipeline, i.e. the sample_outputs/zarr_store/ subdirectory of duet pipeline outputs
This can also be a directory where you have brought together several multi-sample Zarr stores generated from running the duet pipeline (for instance if you have separately processed multiple sequencing runs)
The Zarr store is a multi-sample, multi-dimensional, compressed, indexed data store.

Single type of multi-sample Zarr store for each workflow run

Please ensure that the input path includes only Zarr stores of a single cytosine context (CG, CHG, or CHH) before running the call_dmr workflow. Separate Zarr stores of different cytosine contexts into distinct folders or paths as needed.

DMR sample sheet

The DMR sample sheet should be a tsv file with a header, and with one row per sample that you want included in the DMR calling
The columns should include a sample_id column, a column featuring the conditions between which you wish to call DMRs, and (optionally) columns for any covariates associated your samples. DMR calling will be performed between each pairwise comparison of conditions
Please note that group cannot be used as a column name as this is a reserved word
Here is an example of a sample sheet that would be suitable for calling DMRs between healthy and cancer samples, accounting for sex and age as covariates:

sample_id    disease_state   sex    age
BIO-A1       healthy         M      62
BIO-A2       healthy         M      58
BIO-A3       healthy         F      66
BIO-A4       cancer_stage1   F      66
BIO-A5       cancer_stage1   M      65
BIO-A6       cancer_stage1   M      55
BIO-A7       cancer_stage4   F      61
BIO-A8       cancer_stage4   F      59

Using the above example, the parameters --condition disease_state –covariates sex, age DMR calling would be performed sequentially for “healthy vs cancer_stage1”, “healthy vs cancer_stage4”, and “cancer_stage1 vs cancer_stage4” using “sex” and “age” as covariates

DMR bed path

Should be a path to a bed file containing regions that DMR calling should be performed on. DMR calling will be performed between each condition for each region listed in the bed file, --dmr-bed-path cannot be used in conjunction with --window-size. For instance, you may wish to restrict DMR calling to a set of promoters that you are interested in

BED format is zero-based and half-open, meaning the first position on contig is 0, and the End coordinate is excluded from the interval.

Note

Some methylation quantification outputs from the duet pipeline (e.g. cxreport) have 1-based positions.

chr1        1000171         1001137
chr1        1019119         1020119
chr1        1172879         1173879
chr1        1232030         1232236
chr1        1279435         1280435
chr1        10031831        10032831
chr1        10209569        10210569
chr1        10397591        10398591

If you wish to perform Differentially Methylated Cytosine (DMC) analysis, i.e. single CpG regions, you will need to pass a BED which for the regions are single CpGs.

Note

The inputted DMR BED cannot contain more than 5 million intervals

chr1  10031845        10031846
chr1  10031846        10031847
chr1  10031915        10031916
chr1  10031916        10031917
chr1  10031917        10031918
chr1  10031918        10031919
chr1  10031925        10031926
chr1  10031926        10031927
chr1  10031947        10031948
chr1  10031948        10031949
chr1  10031949        10031950
chr1  10031950        10031951

Window size

If a dmr-bed-path is not supplied, DMRs will be called genome-wide. The genome will be split into window-size bp (default = 1000) regions to perform DMR analysis on.

Note

If a DMR BED is supplied, window-size will be ignored
Reducing the window size for genome-wide DMR calling increases the runtime and memory requirement. The minimum supported window size is 200bp

evoC modifications

In 5bp mode (duet +modC) DMRs can only be called on modC. In 6bp mode (duet evoC) DMRs will be called on mC and hmC separately by default. However, it is possible to call DMRs on any combination of mC, hmC or combined modC
To call hmC DMRs only at enhancer regions use --dmr-bed-path full-path/to-your/enhancer-intervals.bed --evoc-modifications hmc
To call mC DMRs only at promotor regions use --dmr-bed-path full-path/to-your/promoter-intervals.bed --evoc-modifications mc
To call mC, hmC, and combined modC DMRs on 500bp regions genomewide use --window-size 500 --evoc-modifications mc,hmc,modc

Min Depth

If --min-depth is supplied, CG contexts will be removed only if coverage <= min-depth in ALL SAMPLES. Note: this is stranded CG coverage

D(h)MR calling workflow output file formats and locations#

D(h)MR output BED

Field	Description
`Chromosome`	Intervals over which DMR is called
`Start`	Start position of the interval
`End`	End position of the interval
`Name`	Reserved for future use
`Score`	Reserved for future use
`Strand`	Reserved for future use
`num_contexts`	Number of CpG contexts in the
`mean_coverage_across_samples`	Mean CpG coverage across all samples
`mean_mod_group_1`	Mean modification fraction in group 1
`mean_mod_group_2`	Mean modification fraction in group 2
`mod_fold_change`	Fold change between group 1 and group 2
`mod_difference`	Difference between `mean_mod_group_1` and `mean_mod_group_2`
`test_statistic`	Difference between the log likelihood of the full model (with group specified) and the null model (without group specified)
`dmr_pvalue`	Unadjusted p-value
`dmr_qvalue`	Adjusted p-value (Benjamini-Hochberg)

If the above DMR samplesheet was analysed in 6bp, we’d expect the following outputs to be generated in the --output-path directory:

DMR_num_mc_healthy_cancer_stage1.bed
DMR_num_mc_healthy_cancer_stage4.bed
DMR_num_mc_cancer_stage1_cancer_stage4.bed
DMR_num_hmc_healthy_cancer_stage1.bed
DMR_num_hmc_healthy_cancer_stage4.bed
DMR_num_hmc_cancer_stage1_cancer_stage4.bed

D(h)MR output summary

An additional summary.txt file is generated per output BED file which lists metadata about the DMR including the version of the biomodal software package used, the input files, and the full commands used. It also reports the number of DMRs which pass a q < 0.05 significance threshold.

Appendix#

Event codes shared with biomodal#

Event code	Description	Optional?
100	Downloaded the biomodal CLI package, via the biomodal API	No
101	Completed running the biomodal init command	Yes
102	Completed running the biomodal test command	Yes
103	Completed running the biomodal analyse command	Yes
104	Completed running the biomodal dmr_call command	Yes
105	Completed running the biomodal reference make command	Yes
200	Customer preferred to automatically share events and metrics reports. Recorded once during the biomodal init step	Yes
201	Customer preferred to not share events and metrics reports automatically. Recorded once during the biomodal init step	Yes
202	User from the biomodal CRM system has registered for a software download and authentication account, via biomodal API	No

Additional available workflows (advanced users only)#

The nf-core community has several publicly available Nextflow pipelines. The nf-core pipelines are designed to be easy to use and are well documented. You can find a list of the nf-core pipelines at https://nf-co.re/pipelines.

nf-core pipelines may have a similar technical setup as your pipeline execution environment, so it is worth checking if there are relevant Nextflow pipeline settings and parameters you can apply to your Nextflow environment.

Glossary of terms#

Apptainer#

An open-source container platform designed for high-performance computing (HPC) environments, enabling users to package entire scientific workflows including software, libraries, and data, into a portable format.

BAM file (Binary Alignment Map)#

A binary version of a SAM file that stores aligned sequencing reads. Commonly used for efficient storage and analysis of sequence alignment data.

BaseSpace#

A cloud-based platform developed by Illumina for storing, analysing, and sharing genomic data generated by Illumina sequencers.

Command Line Interface (CLI)#

A text-based interface used to interact with software and operating systems by typing commands.

Conda#

An open-source package and environment management system that helps install and manage software dependencies, particularly in Python and R-based projects.

CPU (Central Processing Unit)#

The part of a computer that performs most of the processing. In bioinformatics, CPU power is important for running compute-intensive tasks.

CRAM file#

A compressed version of a BAM file that reduces file size by using reference-based compression. Used for efficient storage of sequencing alignment data.

D(h)MR (Differentially [hydroxy]Methylated Region)#

Genomic regions that show differences in DNA methylation or hydroxymethylation levels between samples or conditions.

Docker#

A platform that uses container technology to package software and its dependencies into isolated environments, enabling reproducible execution across systems.

DISK#

Refers to the amount of disk storage space required or allocated for computational tasks, such as storing large sequencing files.

FASTA file#

A text-based format for representing nucleotide or protein sequences. Each entry starts with a header line followed by the sequence.

FASTQ file#

A text format that stores both biological sequences (usually from next-generation sequencing) and corresponding quality scores.

GCP (Google Cloud Platform)#

A suite of cloud computing services by Google, often used for running scalable bioinformatics analyses and managing data.

High-Performance Computing (HPC)#

A computing environment that uses clusters of powerful computers to perform large-scale or resource-intensive computations.

JAVA#

A widely used programming language and computing platform often required by tools in bioinformatics pipelines, such as GATK.

Linux#

An open-source operating system commonly used in bioinformatics and server environments due to its stability and flexibility.

Nextflow#

A workflow management system designed for reproducible and scalable data analysis pipelines. Supports execution on local, cluster, and cloud environments.

nf-core pipelines#

Community-curated, standardised, and reproducible Nextflow pipelines for a wide range of bioinformatics analyses.

Orchestration VM#

A virtual machine responsible for managing or coordinating the execution of workflows and resources, especially in cloud-based analyses.

RAM (Random Access Memory)#

A type of computer memory that stores working data and code. Larger RAM allows more data to be processed simultaneously, which is critical for many bioinformatics tasks.

Screen#

A terminal multiplexer that allows multiple command-line sessions to be managed within a single terminal window. Useful for long-running processes.

Singularity#

The predecessor to Apptainer, Singularity is a container technology optimised for HPC and scientific computing environments.

Terraform#

An infrastructure-as-code tool that automates the provisioning and management of cloud infrastructure.

tmux (Terminal Multiplexer)#

Like Screen, tmux lets users run multiple terminal sessions in one window, detach and reattach to them, and manage long-running jobs.

TSV file (Tab-Separated Values)#

A plain text format for storing data in a tabular form, where each field is separated by a tab character.

VCF file (Variant Call Format)#

A standardised text file format for storing gene sequence variations, such as SNPs and indels, detected in sequencing data.

Virtual Machine (VM)#

A software-based emulation of a physical computer that runs an operating system and applications in an isolated environment.

Wall-time#

The actual elapsed time a task takes to complete from start to finish, as perceived by the user, regardless of CPU time used.

Zarr file#

A format for storing large N-dimensional arrays in a compressed, chunked manner. Useful in genomics for storing large datasets like 3D imaging or genomics matrices.

duet software installation and running guide#

Getting started with the biomodal duet pipeline#

Overview#

About this guide#

What’s included in the biomodal duet pipeline#

About the pipeline and Command Line Interface (CLI)#

Sharing event & metric data with biomodal#

Support#

Downloading the installation script#

Choose your environment#

Check software requirements & permissions#

HPC installation requirements#

Cloud installation requirements#

Cloud permissions (if using Cloud VMs)#

AWS#

GCP#

Azure#

Download the installation files#

Authentication & configuration#

About software updates and sharing usage data#

Installing the biomodal CLI#

Preventing timeout#

Installation on Cloud VMs (AWS / GCP / Azure)#

Installing on HPC clusters#

Admin installation (recommended)#

Conda installation (not preferred)#

HPC set up recommendations#

Overview#

Allocating sufficient space for temporary files#

Per-task or CPU memory reservation for LSF#

LSF Executor scratch space#

Wall-time and minimum CPU settings#

Set queue, CPU, RAM and DISK per module#

Memory settings for SGE/OGE/UGE clusters#

Manual Installation for HPC users (Advanced)#

Using the CLI#

Command structure and usage#

Input file requirements#

File structure#

FASTQ filename format#

FASTQ sequence identifier requirements#

Inferred instrument type#

Use of the extracted flowcell ID and flowcell lane#

Metadata file#

Before you start#

Avoid session timeouts in long analysis runs#

Check processing and storage requirements#

Elevated resource profiles#

Limit local resources#

Import the pipeline with biomodal init#

Before you start#

Running the biomodal init command#

Expected output#

Output directories overview#

nf-work folder#

nf-results folder#

Key outputs#

Removing data from the temporary pipeline folders#

Checking your set up: Performing test runs#

Run biomodal test#

Run an example analysis with a test dataset (RunXYZ)#

Test with larger GIAB dataset (Optional)#

Troubleshooting and optimisation#

Reduce disk usage#

Disable publishing of resolved FASTQs#

Output CRAMs instead of BAMs#

Resuming a disrupted pipeline#

Low GCP CPU quotas#

Orchestration VM#

Updating the biomodal pipeline#

Using the CLI to update the pipeline#

Manual update of the CLI and pipeline#

Advanced analysis features#

CHG and CHH modification calling#

Mapping Quality Filtering#

Epigenetic Quantification Options#

Epigenetic Quantification Output Formats#

Limit epigenetic quantification to primary chromosomes#

Include SNVs in CG coverage#

Output uncovered CG contexts#