duet software installation and running guide#

For analysis of libraries sequenced on an Illumina platform and prepared using:

  • biomodal duet multiomics solution +modC

  • biomodal duet multiomics solution evoC

This documentation is compatible with duet pipeline v1.4

Warning

For Research Use Only. Not for use in diagnostic procedures.

Getting started with the biomodal duet pipeline#

Overview#

The biomodal duet multiomics pipeline is a bioinformatics workflow designed to analyse FASTQ files produced by next generation sequencing of the libraries generated using biomodal duet multiomics solutions. Before running the pipeline, you need to install the biomodal Command Line Interface (CLI) on a cloud platform or High-Performance Computing (HPC) cluster.

About this guide#

This guide walks you through:

  • Downloading the required files

  • Installing the CLI

  • Preparing your compute environment

  • Using the CLI

  • Importing the pipeline

  • Updating the pipeline

Important

The biomodal duet multiomics pipeline runs in a Linux environment and we recommend the use of a recent version of Ubuntu or CentOS/Rocky/RHEL. It is recommended that setup and installation be performed by your system administrator (i.e., the person responsible for managing your organisation’s IT systems.)

Sections which need to be completed by a system administrator will be shown by the following:

Caution

This section contains information intended for system administrators

Once the setup is complete, you’ll be able to run your analyses with minimal technical involvement. If you need support with any of the steps or installation, please contact us at support@biomodal.com.

What’s included in the biomodal duet pipeline#

Pipeline components

  • Standard and bespoke trimming of duet +modC and duet evoC FASTQ files

  • FASTQ resolution to convert raw paired-end duet +modC and duet evoC reads into resolved, error-suppressed 4-base genomic reads with accompanying epigenomic information

  • Reference genome and control alignment using BWA_MEM

  • Lane-merging of aligned BAM files

  • Forking of aligned genome reads and control reads into separate BAM files and analysis pathways

  • Deduplication of the genome and long control BAM files

Pipeline outputs

  • Modified cytosine quantification in CpG context

  • Optional modified cytosine quantification in CHG and CHH contexts

  • Germline variant calling with optional joint variant calling, somatic variant calling, and allele-specific methylation calling

  • Genetic accuracy analysis using controls

  • Analysis and generation of quality metrics associated with the genome-aligned reads and the control-aligned reads

  • Generation of summary reports in Excel and HTML formats

About the pipeline and Command Line Interface (CLI)#

Important

The biomodal CLI and duet pipeline is not supported on Windows platforms. We recommend you use a recent version of Ubuntu or CentOS/Rocky/RHEL.

The biomodal pipeline and CLI utilises Nextflow as the orchestration tool that will leverage the capabilities of your compute platform to process your data through the duet pipeline. Nextflow will act as an orchestrator performing the following tasks:

  • Launching Virtual Machines or HPC nodes

  • Copying input files to the virtual machines or HPC nodes

  • Downloading and running container images with appropriate software dependencies

  • Executing analyses on the virtual machines or HPC nodes

  • Transferring outputs from analyses to local or cloud storage

  • Organise output files into a convenient directory structure

  • Coordinating the channelling of outputs from one stage of analysis to the next

The following instructions are intended to be executed at the command line with Bash on a Linux platform, or via a Cloud Shell in a browser. Bash is a format of command structure command [OPTIONS] arguments. If you prefer to use the graphical user interface (GUI) of your Cloud provider console through the browser, searching the command line instructions will lead you to the platform specific documentation with the steps that need to be taken in the relevant cloud console.

Sharing event & metric data with biomodal#

To help ensure the best possible user experience and fast technical support, we highly recommend enabling event and metrics sharing. This opt-in feature allows biomodal to better understand how the CLI is being used in real-world environments and to proactively offer support, improvements, and bug fixes. By sharing usage and performance metrics, you’re helping our support and development teams:

  • Identify and troubleshoot issues more quickly

  • Deliver targeted improvements and updates

  • Monitor pipeline health and performance across environments

  • Benchmark and tune future versions of the software

If you opt in, the CLI will send non-sensitive metadata back to biomodal. This includes:

  • Installation and usage events e.g. when the CLI is downloaded, installed, or run

  • Pipeline run status and configuration, such as success/failure, pipeline version, resource usage, and execution times

What we do NOT collect:

  • FASTQ, BAM, VCF, or any input/output files

  • Personally identifiable information related to the pipeline sample data

We do not share any data with third parties and all event information is registered against your user account within biomodal’s internal systems, accessible by our internal support team. For more information, please see Event codes shared with biomodal.

Enabling event and metric sharing helps us to help you — the more we understand about how the pipeline is used in different environments, the faster we can respond to issues and improve future releases. Please let your system administrator know whether to enable event sharing as part of the installation.

How to opt in

During the CLI installation process (biomodal init), you will be asked:

Would you like to share the pipeline metrics report at
the end of all successful analysis runs with biomodal? (1-2)

1) Yes
2) No
#?

Type 1 to opt in. The CLI will automatically handle event registration for future runs.

How to opt out

You can decline metrics sharing at two points:

  1. During biomodal init, select “2” when prompted

  2. Manually, by editing your config.yaml file and updating the following line:

share_metrics: false

If you have any questions or concerns about event and metric sharing, please contact support@biomodal.com.

Support#

Feel free to contact us at support@biomodal.com. If your inquiry is related to the CLI or duet pipeline software, please include the output of the biomodal info command in your inquiry.

Downloading the installation script#

Before installing the CLI, you need to download the biomodal installation package. This section helps you:

  • Choose the correct script for your system (HPC or Cloud)

  • Confirm software requirements

  • Download and unzip the installation files

Important

The following section involves cloud infrastructure setup or requires HPC admin privileges so must be performed by a system administrator. Once the setup is complete, the analyses can be run with minimal technical involvement.

If you need help with any of the steps or installation, please contact us at support@biomodal.com.

Choose your environment#

Caution

This section contains information intended for system administrators

The biomodal duet pipeline supports two environments:

  • HPC clusters (Ubuntu, CentOS/RHEL supported)

  • Cloud platforms (AWS, GCP, Azure)

Use the flowchart below to help you decide which installation method is right for you:

Flowchart to help you choose the right installation method

Check software requirements & permissions#

Caution

This section contains information intended for system administrators

HPC installation requirements#

Make sure your HPC system has the following tools installed:

Dependency

Purpose

Detail

bash

Command execution

Version 4+

JAVA (more information here)

Required by Nextflow

Version 8+ Please check that this is supported by your version of Nextflow.

Nextflow (more information here)

Pipeline orchestrator

Version 21.06 onwards

Singularity or Apptainer

Container runtime

Please check that this is supported by your version of Nextflow. The biomodal CLI expects to find Singularity in $PATH, so please make sure that Apptainer installations include a symlink to the Apptainer binary named Singularity.

Google Cloud CLI (more information here)

Software download and authentication

This should be a recent version that supports the “gcloud storage” command, including the required gcloud-crc32c libraries.

tar, unzip

File compression tools

rsync

File transfer

curl

Software download

jq

For JSON parsing

Other OS support packages:

gnupg, lsb-release, apt-transport-https, gnupg, ca-certificates

Some local Conda installations may require Google SDK components to ensure integrity of data being downloaded. This can be installed using:

gcloud components install gcloud-crc32c

Cloud installation requirements#

Cloud native software will be utilised on each respective platform to set up the complete pipeline environment. You will also need to install the Google Cloud CLI.

Dependency

Purpose

Detail

Google Cloud CLI (more information here)

Required for all cloud installations

This should be a recent version that supports the gcloud storage command, including the required gcloud-crc32c libraries.

Cloud permissions (if using Cloud VMs)#

Caution

This section contains information intended for system administrators

We recommend that a least privilege approach is taken when providing users with permissions to create cloud resources.

The cloud specific examples below demonstrate the minimum required permissions to bootstrap resources for AWS, GCP, and Azure environments.

AWS#
  • Run the following script to create a complete AWS pipeline environment:

    ./biomodal-cloud-utils create aws
    
  • Ensure GitHub repository access is allowed, and public IP assignment is enabled for AWS Batch. This feature can be manually enabled in the console, via the Auto-assign public IPv4 address option in the subnet’s configuration.

  • This will use an additional Terraform module we provide for resource setup (located here)

GCP#

We recommend using GCP’s predefined roles. These roles are created and maintained by Google, so you do not need to create custom roles.

Below are the recommended predefined GCP roles you require access to create cloud resources for running this CLI.

Role name

Purpose

Required

roles/storage.admin

If a new bucket is required

No (unless new bucket is specified)

roles/artifactregistry.writer

Create required artifact registry

Yes

roles/iam.serviceAccountCreator

Creating required service account(s)

Yes

roles/compute.admin

Creating required compute resources and instance template

Yes

roles/iap.admin

Allowing IAP access to VMs

Yes

roles/resourcemanager.projectIamAdmin

To assign project wide roles to service accounts

Yes

roles/storage.legacyBucketWriter

Allow read/write permission on existing bucket

No (Required if wishing to use an existing bucket)

Azure#

If you are planning on using Azure as your cloud environment, please get in contact with us at support@biomodal.com for tailored setup instructions.

Download the installation files#

Caution

This section contains information intended for system administrators

We have enabled a simple tool to directly download all the required scripts to your local computer, HPC node, or cloud VM.

  1. Please run the following command from a Linux terminal or cloud shell:

    bash <(curl -s https://app.biomodal.com/cli/download)
    
  2. You’ll be prompted to authenticate with your existing biomodal username and password:

  3. After download, unzip this file before you can start using the CLI scripts. Please note that you can use different tools to unzip the file (for example unzip) but you can choose other tools, like jar xvf, to uncompress the file:

    unzip biomodal.1.1.3.zip
    

In the unzipped biomodal.1.1.3 folder, you will see the following files:

File or folder

Description

biomodal-cloud-utils

Script to create AWS, GCP or Azure could environment

biomodal-hpc-utils-admin

System administrator installation script

biomodal-hpc-utils-conda

Conda-based installation script

cli/_biomodal_functions.sh

Helper functions for biomodal CLI

cli/_biomodal_validate.sh

Validator functions for biomodal CLI

cli/biomodal

Main biomodal CLI script

clientConfig.json

Customer authentication configuration settings

terraform/

Folder containing Terraform script software for each supported Cloud platform

docs/

Folder with copy of all biomodal CLI documentation

_images/

Images used by documentation

Authentication & configuration#

Caution

This section contains information intended for system administrators

  • Upon login, a ~/.biomodal-auth.json file will be created to control the tokens required to authorise access to the biomodal resources

  • The default location of this file is in the users $HOME directory

  • Each user needs their own token file for running the CLI if event and metrics sharing is enabled (for more information, please see Sharing event & metric data with biomodal)

About software updates and sharing usage data#

Caution

This section contains information intended for system administrators

By default, the biomodal pipeline checks for software updates via Nextflow. To disable this feature:

export NXF_DISABLE_CHECK_LATEST=true

The CLI can send runtime event logs to biomodal to help support teams troubleshoot. For more information, please see Sharing event & metric data with biomodal.

Installing the biomodal CLI#

Important

The following section involves cloud infrastructure setup or requires HPC admin privileges so must be performed by a system administrator. Once the setup is complete, the analyses can be run with minimal technical involvement.

If you need help with any of the steps or installation, please contact us at support@biomodal.com.

Preventing timeout#

Caution

This section contains information intended for system administrators

Because some analyses can run for several hours or even days, we recommend running installation and pipeline commands inside a terminal multiplexer like tmux or screen. This ensures your job won’t be interrupted if your connection drops or you close your terminal window.

Installation on Cloud VMs (AWS / GCP / Azure)#

Caution

This section contains information intended for system administrators

We have created the biomodal-cloud-utils script to assist with the creation of cloud resources required to run the biomodal duet pipeline with Nextflow. The currently supported clouds are GCP, AWS, and Azure.

Please follow the following steps from a Linux compatible computer:

  1. Install Terraform: see this link for OS-specific instructions

  2. Ensure you have permissions to create cloud resources

  3. Authenticate using your cloud provider CLI:

    • GCP: gcloud auth login

    • AWS: aws configure

    • Azure: az login

  4. Run the create command for your cloud:

    ./biomodal-cloud-utils create gcp
    
    ./biomodal-cloud-utils create aws
    
    ./biomodal-cloud-utils create azure
    
  5. Connect to the new environment e.g.:

    ./biomodal-cloud-utils connect gcp
    
  6. When complete, you can choose to remove the newly created resources e.g.:

    ./biomodal-cloud-utils destroy gcp
    

Warning

IMPORTANT INFORMATION

WARNING: Removing your resources is a permanent action, only run this command when you are sure you no longer need them. Buckets you added additional files to should not be deleted.

Note

The cloud VM will be created with a public IP to allow for data ingress and egress. Please update this as required after software install, authentication, configuration, and testing.

Installing on HPC clusters#

Caution

This section contains information intended for system administrators

We have created two scripts to assist with installing required software and downloading the pipeline on an HPC cluster, or standalone server:

  • biomodal-hpc-utils-admin: Recommended for centralised deployment

  • biomodal-hpc-utils-conda: Used if the user lacks the necessary permissions to install software centrally, or when local user software installations need to be separate to not interfere with other installations - but this option is not preferred

Important

Please check and review every step before proceeding as every HPC is different. The following instructions are intended as a guide only.

The software installation scripts described in this section are not intended to be re-run when you upgrade to a newer version of the biomodal CLI or duet pipeline. Please see Updating the biomodal pipeline for more information on how to update the CLI and pipeline software.

Restrictive quotas on local disk resources and software install limitations are likely to cause issues during installation and update operations.

Conda installation (not preferred)#

Caution

This section contains information intended for system administrators

While the CLI may be available via a Conda environment we do not recommend installing or running the biomodal CLI this way for production use. There are several important reasons for this:

  • File size limitations: Local Conda environments often enforce lower limits on maximum file and path sizes, which can cause problems during large-scale analysis

  • Data duplication: Using a local Conda environment may result in duplicated software resources across environments, increasing disk usage unnecessarily

  • Lack of central management: Installing the CLI via Conda means it is not centrally configured or controlled, which can introduce inconsistencies across users and compute nodes in the cluster

Important

For consistency, reliability, and optimal performance, we strongly recommend running the biomodal CLI using the admin installation script (see Admin installation (recommended)) at the network or shared software module level. This ensures environment variables, shared directories, and version updates are centrally managed across all users and nodes.

Executing the scripts will ask you to confirm that you have read the documentation and understand the implications of running this script on your system.

Please confirm that you have read the documentation and
understand the implications of running this script on your system.

Are you sure you want to run this script? (y/n)

Important

Only use this if you are experienced with Conda and your cluster’s Linux environment. This method does not allow for shared installation and may lead to redundant storage usage.

Do not assume that all steps in these scripts will be an exact match for your local server or HPC nodes. If you choose to use the biomodal-hpc-utils-conda script, we recommend you use an existing working conda environment.

Note

Singularity or Apptainer must be installed outside of the Conda environment, by a root or sudo user.

We recommend using tmux or screen to keep sessions alive for long-running tasks. See Preventing timeout for more information.

To begin run the following script:

./biomodal-hpc-utils-conda

Follow the prompts and answer the required questions

  1. CLI installation path

    Enter cli installation path, must be a directory you have
    read/write/execute access to. : /biomodal
    

This is where the CLI software will reside on your system. Ideally choose a central location that users on the system can access if needed.

  • Default: /biomodal

  • Alternative: Choose any directory with enough free space and write access.

    Note

    The HPC users that will run the biomodal pipeline will require read and execute access to this location

  1. Data bucket path

    Enter data bucket path - shared location where ref data and images
    will be stored:
    
    /biomodal/data_bucket
    

This path will store reference data, pipeline images, and shared resources.

  • Default: /biomodal/data_bucket

  • Alternative: Choose a shared network storage location, if working in a group

This location can be re-used for other biomodal runs or installations in the future

  1. Choose Docker registry location closest to your physical location or data centre. This ensures faster downloads and fewer errors during pipeline execution.

    Choose relevant geographical location of the biomodal docker registry
    (1-3)?
    
    1) asia
    2) europe
    3) us
    
    #?
    
  2. Set max concurrent jobs

    Current queueSize is: 200
    
    Would you like to change it? (y/n)
    

This controls how many jobs Nextflow can run at once. If jobs keep failing or hanging, try lowering this value later.

  • Yes: Enter a lower number if you’re working with limited HPC resources

  • No: Keep the default of 200 if you’re unsure

  1. Select job scheduler (executor)

    Which executor would you like to use? (choose 1-4)
    
    1) slurm
    2) lsf
    3) sge
    4) local
    #?
    

Select the job scheduler used by your HPC system:

  • slurm, lsf, or sge: Choose based on what your HPC uses

  • local: Use this only if you’re running everything on a single server without a scheduler

If you have not activated a Conda environment before starting the installation, miniconda will be installed. You will see the following message and prompted to provide a location for your new Conda environment.

conda not found, installing miniconda.

Enter dependency installation root directory (default: home directory)

The script will now complete the installation of the remaining third-party software.

Note

Approximately 150.000-160.000 files (external version dependent at the time of installation) will be installed in your Conda environment.

  1. Once the script finishes you may be prompted to log out and back in again to apply the changes.

Your system is now ready to import and run the biomodal duet pipeline.

HPC set up recommendations#

Caution

This section contains information intended for system administrators

Overview#

When running the duet pipeline on HPC clusters, it’s important to configure your environment correctly to avoid job failures and ensure optimal performance. This section outlines recommended settings for managing temporary directories, scratch space, memory and CPU limits, and scheduler-specific configurations.

These shell environment settings/variables should be added to your cluster environment configuration files. These include bashrc, .bash_profile, .profile (or similar) in order to ensure they are set correctly when the jobs are submitted on your cluster.

Note

Defining these variables within your startup scripts will define these variables for the respective user, and so could impact other pipelines using Apptainer/Singularity. Please consult with your cluster administrator before making these changes.

If not using the duet software for an extended period of time, it is recommended to remove or comment out these lines until resuming analysis with the duet software pipeline.

Important

Always validate changes by running biomodal test before processing full datasets.

Confirm with your system administrator if you’re unsure about your scheduler’s default behaviour.

Allocating sufficient space for temporary files#

The pipeline (especially when using Singularity or Apptainer containers) creates large temporary files in /tmp. If your cluster has limited space in this directory, jobs may fail due to insufficient disk space. We recommend the following:

  • Bind a larger, custom temporary directory in your nextflow.config:

    singularity {
      enabled = true
      autoMounts = true
      runOptions = "--bind /scratch/tmp:/tmp"
    }
    

    This tells Singularity: “Use /scratch/tmp for your temp files instead of the default /tmp.

  • Alternatively, use environment variables for flexibility:

    singularity {
      enabled = true
      autoMounts = true
      envWhitelist = "TMPDIR,SINGULARITY_TMPDIR,SINGULARITY_CACHEDIR,NXF_SINGULARITY_CACHEDIR"
      runOptions = '--bind "$TMPDIR:/tmp"'
    }
    
  • Set the following in your shell environment:

    export TMPDIR=/your/larger/tmp
    export SINGULARITY_TMPDIR=$TMPDIR
    
  • (Optional) to manage where containers are downloaded and stored, define:

    export SINGULARITY_CACHEDIR=/your/cache/dir
    export NXF_SINGULARITY_CACHEDIR=$SINGULARITY_CACHEDIR
    
  • Ensure that this cache directory matches the libraryDir setting in nextflow.config.

Additionally, try adding wider cluster options to the process section in nextflow.config if you are still experiencing issues related to temporary files:

process {
  executor = "slurm"
  containerOptions = "--bind /<your local tmp path>:/tmp"
}

Important

We strongly recommend that both TMPDIR and SINGULARITY_TMPDIR environment variables are set on the head/controller node before any jobs are submitted. This should ensure that the temporary files are written to the correct location.

  • If you don’t do this, and your /tmp directory fills up, you may see an error like this:

    Error in tempfile() using template /tmp/parXXXXX.par:
    Parent directory does not exist
    
    Error in tempfile() using template /<your local tmp path>/parXXXXX.par:
    Parent directory (/<your local tmp path>/) does not exist at /venv/bin/parallel
    

Please note that the variable names may change slightly depending on whether you are using Singularity or Apptainer:

Singularity variable

Apptainer variable

SINGULARITY_CACHEDIR

APPTAINER_CACHEDIR

SINGULARITYENV_TMPDIR

APPTAINERENV_TMPDIR

SINGULARITYENV_NXF_TASK_WORKDIR

APPTAINERENV_NXF_TASK_WORKDIR

SINGULARITYENV_NXF_DEBUG

APPTAINERENV_NXF_DEBUG

Ensure your Apptainer installation includes a Singularity symlink in your $PATH and add these variables to .bashrc or equivalent config files.

Per-task or CPU memory reservation for LSF#

Default LSF cluster settings works with a per-core memory limits mode, so it divides the requested memory by the number of requested cpus. If your LSF cluster is configured differently, it is recommended that you try to add the following settings to the LSF executor section in the nextflow.config file in the biomodal script folder.

executor {
  name = "lsf"
  perTaskReserve = false
  perJobMemLimit = true
}

LSF Executor scratch space#

You can dynamically use LSF scratch space per job using the following settings in the nextflow.config file in the biomodal script folder.

process {
  executor = "lsf"
  scratch = "$SCRATCHDIR/$LSB_JOBID"
}

This ensures temporary files are written to isolated, job-specific directories and cleaned up after job completion.

Wall-time and minimum CPU settings#

In some clusters, it is recommended to set a “wall-time”, i.e. the max time a job can run before it is terminated. There may also be a “minimum CPU” and/or queue requirement to start jobs in your cluster. You can set these in your nextflow.config:

process {
  executor = "slurm"
  memory = “16GB”
  cpus = 4
  nodes = 2
  time = "24h"
  params.cpuNum = 1
  queue = "default"
}

Adjust executor, time, queue, and cpuNum to match your local scheduler and policies.

Set queue, CPU, RAM and DISK per module#

The duet pipeline supports fine-grained resource customisation per module using withName: in the process section. This allows you to allocate additional resources for memory- or CPU-intensive steps. For example, you can set the CPU, RAM, or DISK requirements for the BWA_MEM2 and PRELUDE modules as follows:

process {
  // Default settings for all modules
  cpus = 16
  memory = "16GB"
  queue = "default"
  time = "24h"
  disk = "800GB"

  // Specifically overriding settings for a module
  withName: 'BWA_MEM2' {
    memory = "64GB"
    queue = "high-mem"
    time = "48h"
  }

  withName: 'PRELUDE' {
    cpus = 32
    memory = "32GB"
    queue = "large-cpu"
  }
}

Use this to align specific pipeline steps with available HPC queues and avoid resource contention or throttling.

Memory settings for SGE/OGE/UGE clusters#

If you’re on a cluster that doesn’t support memory flags like h_rt, h_rss, or mem_free, you can pass memory requirements via clusterOptions.

Option 1: Apply globally

process {
  cpus = 16
  memory = "16GB"
  queue = "short"
  time = "24h"
  disk = "800GB"
  clusterOptions = { "-l h_vmem=${task.memory.toString().replaceAll(/[\sB]/,'')}" }
}

Option 2: Apply per module

withName: 'BWA_MEM2' {
  cpus = 32
  clusterOptions = "-l h_vmem=64GB"
  memory = null
  queue = "long"
  time = "48h"
}

If your cluster uses per-core rather than per-job memory limits, divide the total memory by the number of CPUs when setting h_vmem.

Manual Installation for HPC users (Advanced)#

Caution

This section contains information intended for system administrators

You can opt to run a manual install of the CLI and Nextflow config files if preferred. This section explains how to manually install and configure the biomodal CLI and Nextflow settings if you’re not using the biomodal-hpc-utils-admin or biomodal-hpc-utils-conda script.

Important

This setup is only intended for system administrators or advanced users who are comfortable managing configuration files and working with HPC environments.

Step 1: Copy the scripts

Follow the instructions in Manual update of the CLI and pipeline to copy the scripts, make them executable, and create the error configuration files into your preferred biomodal cli installation directory (e.g. /biomodal).

Step 2: Create the config.yaml file

The CLI requires a configuration file named config.yaml to define where to install and run the pipeline. Create this file in your chosen biomodal CLI installation directory, with the following content:

platform: hpc
duet_version: 1.4.2
ref_pipeline_version: 1.0.1
bucket_url: <path to your input/output data storage>
work_dir: <path to a local or shared scratch directory>
init_folder: <path where the duet pipeline software will be installed>
biomodal_registry: <see table below>
biomodal_api_baseurl: https://app.biomodal.com
biomodal_releases_bucket: gs://cegx-releases
biomodal_reference_bucket: gs://release-reference-files
share_metrics: <see table below>
error_strategy: FailFast

Key information:

Content

Detail

bucket_url

Directory or bucket where input/output data is stored

work_dir

A fast, temporary directory used during execution. It should have ~3x the input data size available

share_metrics

If set to true, events and the duet pipeline metrics report will be shared with biomodal after a successful analysis run. For more information, please see Sharing event & metric data with biomodal

init_folder

The location where the duet pipeline and workflow tools will be stored

biomodal_registry

Choose one based on your region

asia-docker.pkg.dev/cegx-releases/asia-prod

europe-docker.pkg.dev/cegx-releases/eu-prod

us-docker.pkg.dev/cegx-releases/us-prod

Step 3: Create the nextflow.config file

This file is required to tell Nextflow how to execute jobs on your system and where to find container images. This local configuration file is applied after any settings defined in the standard duet pipeline configuration files.

Nextflow will automatically cascade the settings in the duet pipeline configurations, followed by the nextflow.config in the versioned biomodal script folder within the init_folder directory (see table above). This ensures that your customised settings are always retained in the nextflow.config file in the biomodal script folder when upgrading to any new releases of the duet pipeline.

Create nextflow.config in the biomodal CLI installation directory and with the following content:

singularity {
  enabled    = true
  autoMounts = true
  libraryDir = "<starts with same path as 'bucket_url' in config.yaml> + '/singularity-images'"
}
params {
  registry = "<see below>"
}
process {
  executor = "slurm"  // or lsf, sge, or local
}
report {
  overwrite = true
}

Choose one registry based on your region:

Content

Detail

registry

asia-docker.pkg.dev/cegx-releases/asia-prod

europe-docker.pkg.dev/cegx-releases/eu-prod

us-docker.pkg.dev/cegx-releases/us-prod

Fill the process section according to your HPC job scheduler (select local if no HPC scheduler is present) to one of the following:

process {
  executor = "slurm"
}
process {
  executor = "lsf"
}
process {
  executor       = "sge"
  process.penv   = "smp"
  clusterOptions = "-S /bin/bash"
  //If your system is not supprting h_rt/h_rss/mem_free settings,
  //you can try to use the following settings:
  //clusterOptions = { "-l h_vmem=${task.memory.toString().replaceAll(/[\sB]/,'')}" }
}
process {
  executor = "local"
}

These settings are layered on top of the duet pipeline’s own configuration and will remain active across pipeline upgrades. For more information, see HPC set up recommendations.

Using the CLI#

This section walks you through how to use the biomodal CLI, from running commands and importing the pipeline. It also includes input file requirements, troubleshooting tips, and details on expected outputs.

Important

Please read through this entire section before starting your analysis and save for future reference.

Command structure and usage#

biomodal

This will display:

biomodal v1.1.3

usage: biomodal [SUBCOMMAND] [OPTION]...

Key CLI commands

Detail

auth

Login to biomodal

init

Import the duet pipeline, reference files, Docker images, and test data

test

Perform a test run using the provided dataset

analyse

Run the duet analysis pipeline on your data

validate

Check your parameters before launching a run

call_dmr

Run the D(h)MR analysis

reference subcommands

Manage reference genome data and pipelines

report

Share a metrics report with biomodal

Each command has parameters and options. Use the biomodal command for more details.

Input file requirements#

The following sections explain how to prepare your input data and directory structure.

Inputs

  • An optional metadata file can be used in .csv format detailing samples used in the experiment

  • A folder named nf-input containing the FASTQ files to be run in the pipeline. The FASTQ files require a specific format for:

    • File names

    • Sequence identifier

    • Inferred instrument type

These inputs are used to create a Read Group used during alignment. For more information, please see FASTQ filename format.

File structure#

The --input-path folder provided for the duet pipeline must contain the nf-input folder (and the metadata .csv file). Your input directory should look like this:

my_input_path
├── biomodal_run_meta.csv
└── nf-input
   ├── Sample-1_S1_L001_R1_001.fastq.gz
   └── Sample-2_S1_L001_R1_001.fastq.gz

Note

For the D(h)MR workflow, please note that the --input-path parameter should point to the full path of the relevant Zarr stores.

For more information on the D(h)MR workflow, see D(h)MR calling workflow.

FASTQ filename format#

FASTQ filenames must follow this pattern satisfying the naming convention used in BaseSpace:

{sample-id}_{sample-number}_{lane}_{R1|R2}_001.fastq.gz

Rules:

  • No underscores in sample-id

  • Sample IDs should not start with a number

  • Must include the sample-number field (even if ignored)

Example: two samples, each pooled across two lanes

CEG900-123-678_S1_L001_R1_001.fastq.gz

CEG900-123-678_S1_L001_R2_001.fastq.gz

CEG900-123-678_S1_L002_R1_001.fastq.gz

CEG900-123-678_S1_L002_R2_001.fastq.gz

CEG900-123-679_S2_L001_R1_001.fastq.gz

CEG900-123-679_S2_L001_R2_001.fastq.gz

CEG900-123-679_S2_L002_R1_001.fastq.gz

CEG900-123-679_S2_L002_R2_001.fastq.gz

FASTQ sequence identifier requirements#

The format of the sequence identifiers in FASTQ files can differ depending upon the software that was used to generate them. The pipeline requires the read identifiers in FASTQ files to comply with the format of the following example:

@A00536:706:HJNLHDRX3:1:2101:2718:1031 1:N:0:ACGGAACA+ACGAGAAC

The table below describes each of the identifiers and their purpose:

Example data

Description

A00536

Instrument name

706

Run ID

HJNLHDRX3

Flowcell ID

1

Flowcell lane

2101

Tile number

2718

x-coordinate

1031

y-coordinate

1

Member of a pair

N

Filtered

ACGGAACA+ACGAGAAC

Index sequences

Inferred instrument type#

The pipeline will infer the instrument type from the instrument name extracted from the first read in each FASTQ file based on the following convention:

Read ID begins with

Instrument inferred as

@A

NovaSeq6000

@LH

NovaSeqX

@VL

NextSeq1000

@VH

NextSeq2000

Others

Unknown

Use of the extracted flowcell ID and flowcell lane#

The sample ID extracted from the FASTQ filename, and the flowcell ID and flowcell lane extracted from the read ID of the first read in the FASTQ file will be used to construct a Read Group which will get passed into the aligner during the alignment step.

The following example shows the Read Group (@RG) generated for sample ID CEG900-123-678, flowcell ID HJNLHDRX3 and flowcell lane 1:

@RG ID:HJNLHDRX3.1 PL:ILLUMINA PU:HJNLHDRX3.1.CEG900-123-678
LB:CEG900-123-678 SM:CEG900-123-678

Metadata file#

The pipeline can use an optional metadata file in .csv format, with one column per sample. The remaining cells in row 1 should contain sample IDs, and the remaining cells in column A should contain metadata field names. There are no requirements about what these field names are, nor the quantity of them.

The pipeline assumes that this file will be encoded in ASCII and that fields and field names will not contain any commas.

  • Must be named in the --meta-file argument

  • Must be placed in the same folder as the FASTQ files

The filename can be anything, but the --meta-file argument must match exactly to the name of the metadata file. Please only use the metadata filename, not the full path.

The metafile will not control which samples are analysed in the duet pipeline, all samples in the nf-input directory in your --input-path location will be processed.

Optional metadata file in .csv format example:

sample_id,CEG900-123-678,CEG900-123-679,CEG900-123-680,CEG900-123-681
sample-condition,case,control,case,control
sample-sex,male,male,female,female
lab-technician,JM,JM,ST,ST
project-id,X001,X001,X002,X002

Before you start#

Avoid session timeouts in long analysis runs#

To avoid session timeouts during long analysis runs, use a terminal multiplexer. For more information, please see Preventing timeout.

Check processing and storage requirements#

For more information on the minimum requirements for each module in the pipeline, please see Module requirements.

Please check the minimum resource requirements to ensure you allocate sufficient CPU, RAM, and disk space depending on your chosen analysis profile. Your cloud tenancy or HPC cluster should be able to allocate a set of VMs/nodes with these resources.

You can edit resource requirements by referencing specific modules in the nextflow.config file in the biomodal script folder to accommodate available hardware resources. Please see examples in the following sections for more information on how to increase or decrease the resource requirements. Please note that the duet pipeline is less likely to run successfully if the minimum resources are too restrictive.

If you have any questions, please contact support@biomodal.com.

Elevated resource profiles#

Each module file in the duet pipeline contains default resource settings in relation to CPUs, memory, and disk. These settings are intended to be sufficient for shallow sequencing runs where you have up to 50 million reads per sample per lane.

On deeper sequencing runs, you should invoke an elevated resource which will allocate additional resources to jobs. For the deepest sequencing runs in UMI-mode, it is also necessary to invoke a CLI parameter that will distribute the duplicate identification step across intervals/contig. The parameters to use are described in the table below:

Reads per sample per lane

Profile

Additional parameter

Up to 50 million

Default

None

Up to 500 million

Deep_seq

–additional-profile deep_seq

Over 500 million

Super_seq

–additional-profile super_seq

If using UMIs and very deep sequencing (>2B reads):

--additional-params dedup_by_contigs=true

Important

Invoking any of these profiles will require additional CPU, RAM, and disk resources to the duet pipeline. Please ensure you have sufficient storage and credit prior to starting any runs.

Limit local resources#

If you have to limit the duet pipeline CPU and RAM resources for the local executor on a single machine, you can use the cpu_num and memory parameters in conjunction with the local or local_deep_seq profiles.

Note

Please note this only works with the local executor, so this will not have any effect if you are using HPC or cloud-based executors.

Example for constrained local machine (limited to 32 CPUs and 128GB RAM):

--additional-profile local_deep_seq --additional-params cpu_num=32,memory="128GB"

Import the pipeline with biomodal init#

Before you start#

Once your cloud or HPC environment is set up, you can import the duet pipeline and all required resources using the biomodal init command.

To get started, run the following:

biomodal init

This command will:

  • Download the latest version of the duet pipeline and D(h)MR workflow (for more information on the D(h)MR workflow, see here)

  • Import all required reference files

  • Import the biomodal test dataset (RunXYZ)

  • Download and cache the required Docker or Singularity containers

  • Prepare the system for a successful test run and future analysis

  • Prompt you to enable metrics sharing

Important

Ensure you have authentication set up with your cloud provider if using GCP, AWS, or Azure. Terraform must also be configured if you are using the biomodal-cloud-utils setup method.

Important

Please do not attempt to run biomodal test or biomodal analyse if you experienced any problems during the bootstrapping or biomodal init process.

biomodal init must always be successfully completed after a fresh install and after any manual update of the CLI and/or duet pipeline.

Please review Elevated resource profiles to ensure you allocate sufficient hardware resources relative to the number of reads per sample before running the duet pipeline.

Running the biomodal init command#

During biomodal init, you will be prompted to:

  1. Enable event and metrics sharing with biomodal

    Would you like to share the pipeline metrics report at
    the end of all successful analysis runs with biomodal? (1-2)
    
    1) Yes
    2) No
    #?
    

If enabled, performance metrics are shared to support ongoing improvements and help biomodal troubleshoot issues more effectively. No sample data is shared. For more information, please see Sharing event and metric data with biomodal.

  1. Specify the duet pipeline installation location

    The duet software location 'init_folder' property is not
    currently defined in the configuration file
    
    Would you like to set it to /<home_folder>? (1-2)
    
    1) Yes
    2) No
    #?
    

This sets where the CLI and pipeline components will be stored.

  1. Choose and error handling strategy

    • Normal: Retry failed jobs up to 10 times

    • FailFast: Stop early if jobs fail

    Please select which duet pipeline error strategy you would like to use:
    
    1) Normal
    2) FailFast
    #?
    

At this stage you will see the following downloads:

  • Genome reference data (unless already downloaded)

  • Container images (unless already downloaded)

  • Pipeline software

  • Small test dataset

Once complete, the CLI will download all files and verify the setup. If any downloads are interrupted, you can clear the partial data and re-run biomodal init.

You’re now ready to proceed with a test run to confirm everything is working correctly.

Expected output#

Once you start a run (e.g. with biomodal test), you’ll see an output like this:

N E X T F L O W ~ version 24.10.6
Launching /opt/apps/biomodal/latest/biomodal-duet-1.4.2/main.nf [boring_pasteur] DSL2 - revision: 65b941077a
executor > slurm (21)
[- ] resolve_align:FASTQC_RAW -
[1e/69d214] resolve_align:PRELUDE (CEG9330132-19-01, L001) [100%] 1 of 1 ✔
[- ] resolve_align:FASTQC_RESOLVED -
[b8/c8c72b] resolve_align:BWA_MEM2 (CEG9330132-19-01, L001) [100%] 1 of 1 ✔
[- ] resolve_align:SAMTOOLS_MERGE_LANES -
[- ] resolve_align:PRESEQ -
[d5/ec7739] bamlet_v2:ONE_STEP_BAMLET (CEG9330132-19-01) [100%] 1 of 1 ✔
[6e/df7d5c] bamlet_v2:QUALIMAP_BAMQC (CEG9330132-19-01) [100%] 1 of 1 ✔
[80/db515c] bam_stats_genome:TSS_BIAS (CEG9330132-19-01) [100%] 1 of 1 ✔
[- ] bam_stats_genome:DEEPTOOLS_PLOTFINGERPRINT -
[3a/292078] bam_stats_genome:PICARD_GCBIAS (CEG9330132-19-01) [100%] 1 of 1 ✔
[42/ff20ec] bam_stats_long_ctrl:LONG_CTRL_GENETIC_ACCURACY (CEG9330132-19-01) [100%] 1 of 1 ✔
[e2/880eac] variant_calling:HAPLOTYPE_CALLER (CEG9330132-19-01) [100%] 1 of 1 ✔
[- ] variant_calling:ASM -
[20/d45b69] mbias:GENOME_MBIAS (CEG9330132-19-01-CG-5-prime) [100%] 1 of 1 ✔
[08/442714] epiquant_v2_genome:EPIQUANT_BAM2ZARR (CEG9330132-19-01-CG) [100%] 1 of 1 ✔
[9f/594e09] epiquant_v2_genome:MODALITY_EXPORT (CEG9330132-19-01_CG) [100%] 1 of 1 ✔
[77/374a22] epiquant_v2_genome:JOIN_ZARR_STORES (CEGX_RunXYZ-genome-CG) [100%] 1 of 1 ✔
[b8/89e666] epiquant_v2_short_ctrl:EPIQUANT_BAM2ZARR (CEG9330132-19-01-CG) [100%] 1 of 1 ✔
[4e/73c13b] epiquant_v2_short_ctrl:MODALITY_EXPORT (CEG9330132-19-01_CG) [100%] 1 of 1 ✔
[- ] epiquant_v2_short_ctrl:JOIN_ZARR_STORES -
[8f/35d181] epiquant_v2_long_ctrl:EPIQUANT_BAM2ZARR (CEG9330132-19-01-CG) [100%] 1 of 1 ✔
[c4/74201b] epiquant_v2_long_ctrl:MODALITY_EXPORT (CEG9330132-19-01_CG) [100%] 1 of 1 ✔
[1c/575553] epiquant_v2_long_ctrl:JOIN_ZARR_STORES (CEGX_RunXYZ-long_ctrl-CG) [100%] 1 of 1 ✔
[e0/6821c2] summary:DQSPY (CEGX_RunXYZ) [100%] 1 of 1 ✔
[31/f8e5ed] summary:PIPELINE_REPORT (CEGX_RunXYZ) [100%] 1 of 1 ✔
[2b/9cc002] summary:DQSREPORT [100%] 1 of 1 ✔
[ad/d278e7] summary:MULTIQC (CEGX_RunXYZ) [100%] 1 of 1 ✔
Completed at: 15-Apr-2025 15:52:10
Duration : 33m 51s
CPU hours : 5.3
Succeeded : 21

Output directories overview#

Outputs from the Nextflow pipeline get published to your cloud storage bucket or output directory, organised into a directory structure.

Top-level:

gs://my-org-nextflow/test-runxyz

Subdirectories:

Folder

Description

Detail

nf-input

FASTQ input files

FASTQ files should be copied into this directory prior to launching the pipeline. Please note that all FASTQ files undergo analysis regardless of sample info in the meta file.

nf-work

Intermediate files, logs

This contains logs and temp data, organised by Nextflow job ID. This is useful for debugging.

nf-results

Final analysis outputs

This is the directory where the outputs of the pipeline are stored, organised by pipeline run, sample, pipeline module. This directory is described further below.

nf-work folder#

This is the working directory for Nextflow where logs and files staged for downstream processes are stored. There will also be a directory that includes the pipeline version and the tags that you set when launching the pipeline.

The directory structure will comprise of hashes which match those displayed in Nextflow and which uniquely identify specific jobs launched by Nextflow. For example:

gs://my-org-nextflow/test-runxyz/nf-work/duet-1-0.1.0_2021-05-15_1656_5bp/01/af2b9a7434a6ca435b96c6b84cb9a2

This directory is useful for debugging the pipeline, examining logs and viewing the STDOUT from jobs. Inside this directory there will be subdirectories associated with each pipeline execution.

If the pipeline is running smoothly on your runs, you will rarely need to look in the nf-work directory. If a pipeline has completed successfully and you have no intention of resuming it with modified settings or examining logs, you can delete the contents of the associated subdirectory inside the nf-work directory.

nf-results folder#

Subdirectory

Contents

reports

Sample-level and multi-sample reports summarising information about the samples and controls

sample_outputs

Primary data files generated by the pipeline (described in more detail below)

controls

BAM files and quantification files associated with the methylated lambda and unmethylated pUC19 controls. These small files are analogous to the BAM files and quantification files generated for your samples and may be useful for familiarising yourself with the file formats. Note that there is an accompanying FASTA file for the controls in the reference file directory with the following name/location: ss_ctrls/v24/ss-ctrls-long-v24.fa.gz

diagnostics

Secondary outputs from the pipeline, including a parameters log recording the parameters that were used to execute the pipeline, more extensive metrics to support more detailed data investigations, and the interim resolved FASTQ files that were passed into the aligner

Key outputs#

  • reports/summary_reports: aggregate QC reports of all samples

  • reports/sample_reports: HTML QC reports per sample

  • sample_outputs/bams: Deduplicated BAM files

  • sample_outputs/modc_quantification: quantification files reporting modified cytosines

  • sample_outputs/variant_call_files: VCF files

For a more detailed overview of the duet pipeline outputs and explanation of the formats, please see the Bioinformatics data interpretation guide.

Removing data from the temporary pipeline folders#

The temporary pipeline folders created by the duet pipeline can be removed upon successful completion of the pipeline run. If a pipeline has completed successfully and you have no intention of resuming it with modified settings or examining logs, you can delete the contents of the associated subdirectory inside the nf-work directory. Please see Output directory overview or more details about the pipeline folder structure.

Checking your set up: Performing test runs#

Before analysing your own data, it’s important to confirm that the pipeline is correctly installed and that your environment has been set up properly. The test runs below ensure that:

  • The duet pipeline and CLI are working as expected

  • Reference files and containers are accessible

  • Your compute environment has appropriate resources and permissions

  • Any issues with configuration, dependencies, or file paths are identified early

Run biomodal test#

Important

You should run the following tests immediately after completing biomodal init. If the test fails, do not proceed with analysing your own data.

After biomodal init completes without errors, run the following to verify your setup:

biomodal test

Example output at the end of the test run:

Completed at: 10-Mar-2023 12:51:33
Duration : 45m 15s
CPU hours : 10.0
Succeeded : 21

If the test fails:

  • Check your internet connection, especially if you’re using cloud resources

  • Verify file paths in your config.yaml are correct and accessible

  • Check authentication: rerun biomodal auth if credentials may have expired

  • Inspect logs in the nf-work directory for errors

Reach out to support@biomodal.com with the test output and any relevant log files.

Run an example analysis with a test dataset (RunXYZ)#

The RunXYZ dataset is a small, pre-configured test dataset provided by biomodal as part of the pipeline installation. It includes a minimal set of sample files and metadata to simulate a typical pipeline run. Running this test helps you verify that:

  • Your environment (local or cloud/HPC) is configured correctly

  • Required dependencies (e.g. Docker/Singularity, Nextflow) are working

  • Reference data and container images are accessible

  • Your CLI config (e.g. config.yaml) and nextflow.config are valid

  • File paths and permissions are correctly set up

This test typically takes 30-60 minutes depending on your environment. It requires significantly fewer resources than a full pipeline run and does not generate large intermediate or final files.

To run the test, enter the following command:

biomodal analyse \
  --input-path /biomodal/data_bucket/test-runxyz \
  --meta-file CEGX_Run_meta.csv \
  --output-path /biomodal/data_bucket/test-runxyz \
  --run-name CEGX_RunXYZ \
  --tag CEGX_RunXYZ \
  --additional-params lib_prefix=CEG9330132-19-01_S12_L001 \
  --mode 5bp

Once completed, check the nf-results directory for output files and reports. This confirms that your system is ready to analyse your own data.

Please note that this manual test script will allow resuming the test analysis if/when it is interrupted.

Important

Make sure biomodal init has completed before running this command. If the dataset wasn’t downloaded or configured correctly, re-run biomodal init.

Test with larger GIAB dataset (Optional)#

To simulate real-world conditions, biomodal recommends testing the pipeline with a larger publicly avaiable duet dataset from Genome In A Bottle (GIAB).

Important

This test will incur additional runtime and/or storage costs, particularly if run on cloud infrastructure. Ensure that your cloud environment has sufficient resources and budget allocated before starting the GIAB test.

  1. Download the GIAB dataset (either +modC or evoC) which can be found here: https://biomodal.com/kb_articles/demo-data-instructions/

  2. Run the pipeline:

biomodal analyse \
  --input-path  <path_to_giab_data> \
  --meta-file CEGX_Run_meta.csv \
  --output-path <path_to_output> \
  --additional-profile deep_seq \
  --tag giab_demo_data \
  --mode <5bp|6bp>

The mode selected should be 5bp or 6bp dependent on the demo data selected.

If you have any issues, please contact support@biomodal.com.

Troubleshooting and optimisation#

This section outlines common issues and performance optimisation strategies. If you have any questions or need further help, please reach out to support@biomodal.com.

Reduce disk usage#

The output footprint of a full pipeline run is typically ~1.4x the size of the input FASTQ files. You can reduce this by:

Disable publishing of resolved FASTQs#

Resolved FASTQ files are large (>50% of output) and not required for downstream analysis. To avoid publishing them:

--additional-params publish_by_default.prelude=false

Output CRAMs instead of BAMs#

The duet pipeline outputs sequence alignments as BAM files by default but offers the option to output genome sequence alignments as CRAM files. CRAMs will be outputted with a reference genome FASTA (which is required to read CRAMs). CRAMs save ~40% disk space compared to BAMs. To publish CRAMs:

--additional-params publish_crams=true

Resuming a disrupted pipeline#

If your run was interrupted, rerun the exact same command with:

--resume

This might be useful if your terminal session is accidentally terminated, such as by a break in network connectivity. Please ensure that you use the same parameters as the previous pipeline when resuming.

Low GCP CPU quotas#

If using GCP with limited CPU quotas (e.g. free trial accounts), please contact support@biomodal.com as a custom resource profile may be needed.

Orchestration VM#

Your orchestration VM only needs to run when launching pipelines. You can stop it from the clound providers console when inactive, and start again when you need to launch another analysis run.

Updating the biomodal pipeline#

This section explains how to check for new versions of the duet pipeline and update both the CLI and D(h)MR workflow. For more information on the D(h)MR calling workflow, please see here.

Important

Updating the pipeline may require super-user admin privileges depending on your system setup. Please check with your system administrator if you are unsure.

Using the CLI to update the pipeline#

To check for available duet pipeline versions, run the following:

biomodal list

This will list all available versions of the duet pipeline software. Please be aware that some may be older than your current local version.

To install a newer version of the duet pipeline software please run:

biomodal download <version>

This updates containers, reference files, and test data for the selected version. For more information, please see Import the pipeline with biomodal init.

Note

This will overwrite the local files for the specified version. Other installed versions will not be affected.

You can also check for available updates with:

biomodal info

To directly update to the latest version of duet, and change your local settings, you can run:

biomodal init

This updates any updated containers, reference files, and test data for the selected version. For more information, please see Import the pipeline with biomodal init.

Manual update of the CLI and pipeline#

You can manually update to CLI 1.1.3 and duet 1.4.2, or newer:

  1. Backup your installation folder

  2. Download and unzip the new CLI version (See Download the installation files)

  3. Copy the following files from ./cli into your local installation directory:

    _biomodal_functions.sh
    _biomodal_validate.sh
    biomodal
    
  4. Update failfast.error.config and retry.error.config with new settings.

HPC version

failfast.error.config:

process {
  errorStrategy = { sleep(Math.pow(2, task.attempt) as long); return task.exitStatus in [14] ? 'retry' : 'terminate' }
}

retry.error.config:

process {
  errorStrategy = { task.attempt.toString().isInteger() ? sleep(Math.pow(2, task.attempt) as long) : sleep(1) ; return task.exitStatus in [14] ? 'retry' : 'ignore' }
  maxRetries    = 10
}

Cloud version

failfast.error.config:

process {
  errorStrategy = { task.attempt.toString().isInteger() ? sleep(Math.pow(2, task.attempt) as long) : sleep(1) ; return task.exitStatus in [0,10,14,50001,50002,50003,50004,50005,50006] ? 'retry' : 'terminate' }
  maxRetries    = 1
}

retry.error.config:

process {
  errorStrategy = { task.attempt.toString().isInteger() ? sleep(Math.pow(2, task.attempt) as long) : sleep(1) ; return task.exitStatus in [0,10,14,50001,50002,50003,50004,50005,50006] ? 'retry' : 'ignore' }
  maxRetries    = 10
}
  1. Make the scripts executable:

chmod +x biomodal _biomodal_functions.sh _biomodal_validate.sh
  1. Download the latest duet version

biomodal download 1.4.2

Alternatively, Run biomodal init to update your local preferences and verify that you have all the required container images and reference files

  1. Edit your config.yaml to add:

ref_pipeline_version: 1.0.1
  1. Add the following to nextflow.config:

report {
  overwrite = true
}
  1. Run a test to validate the update:

biomodal test

You should now be ready to begin using the updated software.

Advanced analysis features#

CHG and CHH modification calling#

By default, the pipeline calls modified cytosines in CpG context only. However, the pipeline includes the capability to call modified cytosines at CHG and CHH sites (where H is the IUPAC disambiguation code representing a DNA nucleotide other than G). To invoke CHG and CHH modification calling, add the following flag when launching the pipeline:

--chg-chh-contexts true

Mapping Quality Filtering#

By default, the duet pipeline discards reads from the BAM file with a MAPQ (Mapping Quality) score of 0. However, you can adjust the stringency of this filter as needed.

To include all reads regardless of mapping quality:

--additional-params bamlet.min_map_qual=0

To retain only reads in the BAM with MAPQ >=20, thereby discarding lower-confidence mappings:

--additional-params bamlet.min_map_qual=20

The pipeline also supports more stringent MAPQ filtering specifically for reads used in epigenetic quantification. For example, to keep reads in the BAM with MAPQ >=1 but restrict epigenetic quantification to reads with MAPQ >=20:

--additional-params bamlet.min_map_qual=1,epiquant.min_map_qual=20

Epigenetic Quantification Options#

Epigenetic Quantification Output Formats#

In addition to the zarr store, the duet pipeline is able to produce four types of plain-text epigenetic quantification outputs; bedmethyl, cxreport, bismark or bedgraph. The default is to produce cxreport, but any combination can be requested. To produce cxreport and bismark:

--quantification-output cxreport,bismark

Limit epigenetic quantification to primary chromosomes#

By default, the duet pipeline will quantify epigenetics at each CG context present on contigs included in the primary assembly BED (if the species has an available primary assembly BED). In GRCh38 Human reference this includes unlocalised scaffolds (e.g. chr1_KI270706v1_random). To exclude these scaffolds from epigenetic quantification (note they will still be present in the BAM):

--additional-params epiquant.primary_chromosomes_only=true

Include SNVs in CG coverage#

By default, the pipeline counts total coverage at a reference CG position as the number of Cs (modified or unmodified) at a given position. However, you can optionally include A, G and T base calls in total coverage:

--additional-params modality.export.only_c_in_coverage=false

Note the total coverage in methylation quantification outputs will be used as the denominator to calculate percenatge methylated/hydroxymethylated.

Output uncovered CG contexts#

By default any reference CG contexts which are uncovered in a given sample will be omitted from the plain-text quantification outputs. To include zero-coverage CG contexts in quantification outputs:

--additional-params modality.export.skip_rows_zero_counts=false

Variant Calling and Variant Associated Methylation#

The duet pipeline contains modes for calling germline variants using GATK HaplotypeCaller and for calling somatic variants using the GATK Mutect2.

Additionally, an Allele Specific Methlation (ASM) module can be activated which will evaluate each heterozygous germline variant for the presence of methylation differences between the two alleles.

Germline Variant Calling#

By default, germline variant calling is performed on each sample individually using GATK HaplotypeCaller generating a separate VCF file for each sample.

Germline variant calling can be disabled by setting the following parameter when launching the pipeline:

--additional-params call_germline_variants=false

Joint Variant Calling#

The pipeline also supports joint variant calling across multiple samples. Joint variant calling can be activated by adding the following parameter when launching the pipeline:

--use-gvcf true

This will cause the GATK HaplotypeCaller to generate per-sample gVCF files (instead of VCF files) which will then be consolidated via GATK GenomicsDBImport followed by joint genotyping using GATK GenotypeGVCFs.

Allele Specific Methylation#

The Allele Specific Methylation (ASM) module can be activated to evalute each heterozygous germline variant for the presence of methylation differences between the two alleles. The ASM module is disabled by default, but can be activated by adding the following parameter when launching the pipeline:

--compute-asm true

To perform ASM calling, variants will be ‘atomized’ using bcftools norm so that complex variants are decomposed - i.e. MNVs will be split into consecutive SNVs. An ASM file will be generated for each sample.

If the --compute-asm true parameter is combined with the --use-gvcf true parameter described above, an ASM file will still be generated for each sample, but ASM calling will be performed using the atomized outputs from joint variant calling. A subsequent step will combine per-sample ASM calls into a single file.

Somatic Variant Calling#

The pipeline also supports somatic variant calling which is performed using GATK Mutect2 followed by GATK-recommended filtering using FilterMutectCalls. This is performed in the “tumour-only” mode of Mutect2 (i.e. it is not run using pairing of a tumour / matched normal).

Somatic variant calling can be activated by adding the following parameter when launching the pipeline.

--additional-params call_somatic_variants=true

The filtering of somatic variants using FilterMutectCalls happens by default if somatic variant calling is activated, but filtering can be disabled by adding the following parameter when launching the pipeline:

--additional-params filter_somatic_variants=false

Subsampling#

If you wish to subsample the reads in your sample prior to processing (such as to ensure equal numbers of reads in each sample for inter-library comparison purposes), then this is possible by adding the following parameter and indicating the maximum number of reads that you want to retain on a single lane for a single sample:

--additional-params subsample=50000000

Important

This subsampling is performed per sample, per lane, so in the above example, if there were 4 lanes, then this would limit the processing to 50M reads per sample per lane, so 200M reads in total per sample.

Alternative Supported Reference Genomes#

By default the duet pipeline will align against a reference file containing the human reference genome GRCh38 and the sequences of the spike-in controls supplied with the kit. It is also possible to run the pipeline relative to other available reference genomes (e.g. mouse) first by downloading the preferred reference data from biomodal:

biomodal reference list
biomodal reference download GRCm38p6

Then the following flag can be used when launching the pipeline:

--reference-genome GRCm38p6

Alternatively you can create your own reference genome by following the instructions in the make reference genome documentation.

Supported Mouse References#

The duet pipeline supports three versions of the mouse reference; GRCm38 and GRCm38p6 and GRCm39.

GRCm38 (2012): original GRCm38 with additional patch fix contigs, resulting in a large reference (>12 Gbp) requiring more resource to process data. Only use if your study requires detailed patch-level alignment. Note that reads aligning to contigs which are not present in the primary assembly bed (including patch fix contigs) will be discarded by the duet pipeline.

GRCm38p6 (2017): final patch version of GRCm38, incorporating cumulative patch updates into the primary assembly. Use this reference for studies requiring compatibility with legacy GRCm38 tools and datasets.

GRCm39 (2020): latest mouse genome assembly without legacy patches.

Processing individual samples#

Due to resource constraints, there may be a need to process individual samples in the duet pipeline. This will require a smaller amount of CPU and RAM resources as you will analyse each sample separately. You can use the lib_prefix parameter when executing the duet pipeline to specify the prefix of the library to process. You can add the prefix to the --additional-params parameter to your biomodal analyse command as per this example:

--additional-params lib_prefix=SAMPLE1_L001

Enable generation of the FASTQC report#

By default the FASTQ report is not generated in the duet pipeline. To enable generation of the report on your raw FASTQ files, please add the following parameter to biomodal analyse when launching the pipeline:

--additional-params run_by_default.fastqc.raw=true

Target enrichment#

Overview#

The biomodal duet pipeline includes a mode for processing data generated using targeted panels. The analysis pathways for processing targeted libraries are slightly different because the pipeline calculates targeted metrics and makes use of additional reference files describing the bait and target regions associated with the target panel being used.

Processing data using a supported target panel#

Relevant reference files for the following target panels are available by default:

  • Twist Pan-cancer Methylation Panel

  • Twist Human Methylome Panel

A targeted analysis for either of these panels can be achieved by passing the following parameters to the biomodal analyse command.

Parameter

Description

targeted

Set to true

targeted-panel

For the Twist Alliance Pan-cancer Methylation Panel set to: twist_cancer. For the Twist Human Methylome Panel set to: twist_methylome

Here is an example biomodal analyse command with these additional parameters passed in:

biomodal analyse \
  --input-path [path_for_input_files] \
  --output-path [path_for output_files] \
  --meta-file [meta_file_name] \
  --run-name [name_for_run] \
  --tag [date_time_or_other_tag] \
  --additional-profile deep_seq \
  --targeted true \
  --targeted-panel twist_cancer

For any other target panels, it is still possible to perform a targeted analysis, but it is necessary to generate relevant reference files first. The remaining instructions describe how to generate the necessary reference file for alternative target panels and how to launch the pipeline in these circumstances.

Processing data using alternative target panels#

Three additional files are required when running the duet pipeline in targeted mode with alternative target panels. These are the bait and target interval files and a BED file containing the target coordinates. The target and bait intervals are used by gatk CollectHsMetrics to calculate target metrics including target coverage and target enrichment. The BED file is used to filter methylation and variant calls down to the targeted regions of interest.

Generating bait and target interval files#

Bait and target interval files are generated from BED files containing bait and target coordinates with the tool picard BedToIntervalList as shown below:

# Make probes intervals
picard BedToIntervalList I=probes.bed \
O=reference_probe.intervals \
SD=reference.dict

# Make targets intervals
picard BedToIntervalList I=targets.bed \
O=reference_targets.intervals \
SD=reference.dict

Note that this command should be run independently of the biomodal duet multiomics solution CLI in an environment with picard available.

This process requires an additional genome dict file which is created from a FASTA file with your reference genome of interest:

picard CreateSequenceDictionary \
R=reference.fasta \
O=reference.dict

If you are working with the reference files provided with the pipeline, then you can obtain a genome dict file from the following locations in the reference file directory:

Human: duet/duet-ref-0.1.0/reference_genomes/GRCh38Decoy/GRCh38Decoy.dict

Mouse: duet/duet-ref-0.1.0/reference_genomes/GRCm38p6/GRCm38p6.dict

Launching the duet pipeline in targeted mode#

You should use the same profiles and commands when launching the duet pipeline in targeted mode as you would when running the duet pipeline in non-targeted mode, except that there are additional parameters and file paths that need to be supplied as key-value pairs using the --additional-params option when running the pipeline in targeted mode with UMIs.

Parameter

Description

targeted

Set to true

targeted_bait

The path to the bait interval file generated from the steps described above

targeted_target

The path to target interval file generated from the steps described above

targeted_bait_bed

The path to bed file with bait coordinates

These additional parameters should be supplied as key-value pairs to the biomodal analyse command when launching the pipeline by using the --additional-params option, as in the following example:

biomodal analyse \
  --input-path [path_for_input_files] \
  --output-path [path_for output_files] \
  --meta-file [meta_file_name] \
  --run-name [name_for_run] \
  --tag [date_time_or_other_tag] \
  --additional-profile deep_seq \
  --targeted true \
  --additional-params targeted_bait=[path_to_file],targeted_target=[path_to_file],targeted_bait_bed=[path_to_file]

Automation#

You can automate the auth and init stages of the biomodal CLI for CI and test purposes.

“biomodal auth” stage#

When running a biomodal auth command you can optionally pass in username and password credentials using the command line flags below:

auth                                Login to biomodal
   -u | --username <email>           username
   -p | --password <password>        password
   --password-stdin                  Take the password from stdin (recommended)

Username and passwords can also be passed in as environment variables BIOMODAL_USERNAME and BIOMODAL_PASSWORD. Environment set variables will always take precedence over CLI applied variables, so please bear this in mind if using this method.

“biomodal init” stage#

When re-running the biomodal init command you can skip prompts for setting your temp directory and init folder location by setting an environment variable IGNORE_INIT_PROMPTS=true. Please note that you have to either edit the config file or follow the prompts the initial time the script is run.

You can also do this by adding ignore_init_prompts: true to your config.yaml file.

You can also skip biomodal CLI prompts for error strategy by setting an error strategy in your config.yaml file.

Normally you’ll be prompted to choose either a normal or failfast error strategy. If you set either one of these values in your config.yaml file the biomodal CLI will skip this prompt:

error_strategy: <normal|failfast>

Alternatively you can set the environment variable ERROR_STRATEGY to normal or failfast.

Finally you can also define whether to share events and metrics reports by either setting the SHARE_METRICS environment variable to true or false, or by setting the value directly in the config.yaml file:

share_metrics: <true|false>

If neither of these are set, the biomodal CLI will require user input as to whether to set these values.

Note Settings in the config.yaml file will take precedence over environment variables. If these values are both set but there are disparities between the two, you may experience unexpected results.

Reference genome pipeline#

You can download the reference genome pipeline from the biomodal CLI. The reference genome pipeline is used to create new reference genomes for the duet pipeline. The reference genome pipeline is a separate pipeline from the duet pipeline and is not required to run the duet pipeline using the default reference genomes provide by biomodal.

Checking available reference pipeline versions#

The reference pipeline is not downloaded during a new biomodal software install by default. To download the files necessary to run the reference pipeline, compatible version(s) of the reference pipeline must be downloaded. To view the list of reference pipeline versions available/compatible that can be downloaded in the next step, run the following command:

biomodal reference pipeline list

Reference pipeline download#

This step will download and set up the necessary files and scripts to run the reference pipeline.

To download a new version of the reference genome pipeline, run the following command:

biomodal reference pipeline download <version>

The version number should be the version you want to download. You can find the available versions by running the biomodal reference pipeline list command.

Please note that if you have already downloaded a version of the reference genome pipeline, downloading an update of the duet pipeline will download updates the reference pipeline should this be required.

Downloading this pipeline requres a new parameter entry ref_pipeline_version in the config.yaml file. This will be automatically added when you download the reference pipeline. Please note that the reference pipeline version is not the same as the duet pipeline version.

Make new reference genomes#

Running the reference genome pipeline#

To make new reference genomes using the reference genome pipeline, run the following command:

biomodal reference make
  --input-path <your-input-path>
  --output-path <your-output-path>
  --species <reference-genome-species>
  --reference-genome <reference-genome-official-name>

Parameter

Description

--input-path

Path for the reference genome gzipped FASTA file

--output-path

Custom output disk location or bucket url

--species

Required: reference species, e.g. Homo_sapiens, Mus_musculus

--reference-genome

Genome Reference Consortium official name, e.g. GRCh38Decoy, hs38DH

Please note that all parameters are required to run this pipeline.

The reference pipeline will not write directly to the central duet pipeline reference location, but will write to a staging location specified in the --output-path parameter. This enables manual verification of the output before moving it to the reference location. And it prevents the pipeline from needing write access to the reference location.

Reference genome input and output files requirements#

Reference genome input file requirements#

To run the pipeline on a new species, it is necessary to generate new reference files by running the reference file pipeline. This takes a gzipped FASTA file for the reference genome.

With the exception of the new FASTA reference genome file, all of the files above are supplied with the duet pipeline and are assumed to have been downloaded and unpackaged according to the instructions supplied. These files will be automatically detected when the Alternative Reference Genome Pipeline is run.

The reference genome for the species required should be obtained from an appropriate source, e.g. the Genome Reference Consortium or by searching the Ensembl or EnsemblPlants genomes database.

When you have downloaded the gzipped FASTA file, please provide the full location and filename in the --input-path parameter. The pipeline will automatically detect the file and use it to generate the reference genome.

Some further notes about preparing the FASTA reference genome:

  • Some sources supply reference genomes in FASTA format but not gzipped, for example inside a zip bundle with some metadata files. In these cases, you will need to gzip your FASTA file before you can use it as an input to the reference genome pipeline by running gzip /path/to/my/reference.fasta

  • The contig names inside the FASTA file (the lines that start with the > character) are important since they are used inside the duet pipeline. Sometimes these contigs have long names; so that the duet pipeline can run effectively, you will need to either rename these contigs to match those you add to the genome config file or supply a mapping to map the long name to the names you supply in the config. For example, your contig name might be >NC_044371.1 Cannabis sativa chromosome 1, cs10, whole genome shotgun sequence and you might want to rename it to >chr1 so that you can just add chr1 to the list of contigs in the genome config file.

  • Make sure the contigs in the reference FASTA contain what you expect:

    • Not all reference downloads contain the mitochondrial chromosome, which might be important for your analysis.

    • There may be other contigs that are needed for your reference that are downloaded separately, such as the chloroplasts for a plant genome.

    • Many references contain unmapped scaffold contigs; these may be important but should probably be excluded from the list of contigs you supply in the genome config file.

    • You can inspect the contigs by running grep ">" /path/to/my/reference.fasta.

Reference genome output file locations#

The output files will be stored in the location specified by the --output-path parameter above. Here you will find the following top level sub-directories:

Location

Description

reference_genomes/<reference-genome>

Reference genome output

ss_ctrls/<ctrls>

Controls output

In the reference genome output location, the files are:

Location

Description

<reference -genome>-ss-ctrls-<ctrls>.fa.gz

Combined reference genome and controls FASTA

<reference- genome>-ss-ctrls-<ctrls>.fa.fai

Combined reference genome and controls FASTA index

<reference-genome>.bed

BED file containing reference genome contig definitions

` <reference-genome>.chrom.sizes`

List of reference genome contig sizes

<reference-genome>.chrom.txt

List of reference genome contig names

<referenc e-genome>-ss-ctrls-<ctrls>.dict

List of both reference genome and controls contigs in dictionary format

<referen ce-genome>_primary_assembly.bed

Primary assembly BED for reference genome. This will be exactly the same as the reference genome bed, except for human, mouse or zebrafish

After the pipeline has run successfully, the reference genome files, and a set of existing couplet q-tables files and controls, should be copied to the shared reference genomes folder where the existing genomes are held. This is the same location as the default reference genomes provided when you installed the duet pipeline. (<default bucket/folder location of your pipeline data>/reference_files/). Example using GCP storage bucket:

# Set the relevant reference genome version
REF_VERSION=1.0.1

# Copy the reference genome files to the shared reference genomes folder
gcloud storage cp --recursive \
 "gs://bucket-name/reference-output-folder/reference_genomes/GRCz11/*" \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCz11/duet/duet-ref-${REF_VERSION}/reference_genomes/GRCz11/

# Copy existing couplet qtables to the new reference genome folder
gcloud storage cp --recursive \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCh38Decoy/duet/couplet \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCz11/duet/

# Copy existing controls to the new reference genome folder
gcloud storage cp --recursive \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCh38Decoy/duet/duet-ref-${REF_VERSION}/ss_ctrls \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCz11/duet/duet-ref-${REF_VERSION}/

The controls output files are:

Location

Description

ss-ctrls-long-v24.bed

BED file containing long control contig definitions

ss-ctrls-long-v24.chrom.txt

BED file containing long control contig names

ss-ctrls-long-v24.chrom.sizes

BED file containing long control contig sizes

ss-ctrls-short-v24.bed

BED file containing short control contig definitions

` ss-ctrls-short-v24.chrom.sizes`

BED file containing short control contig sizes

ss-ctrls-short-v24.chrom.txt

BED file containing short control contig names

Note that the controls output files will be exactly the same as those already provided with the duet pipeline.

The original input files are also copied to the reference genome/controls subdirectory, as applicable.

Reference genome config file#

Once the pipeline has been run successfully and outputs checked, the reference genome directory (i.e. <output-path>/references_genomes/GRCz11) should be copied to the reference_genomes folder where the existing genomes are held.

The CLI will generate an config file template. The alternative reference can be used by providing a set of specified fields to the pipeline in the correct format, for example:

params {
  genomes {
    '<<ref_genome>>' {
        contigs          =    ''
        mapping          =    false
        autosomes        =    ''
        mitochondria     =    ''
        primary_assembly =    true
        primary_assembly_bed = "<<ref_genome_path>>/duet/duet-ref-<<ref_pipeline_version>>/reference_genomes/<<ref_genome>>/<<ref_genome>>_primary_assembly.bed"
        target_intervals =    false
        species          =    '<<species>>'
    }
  }
}

This reference genome template file should be edited to include the correct sections for the new reference genome. For any new reference genome, you will need to define these parameters as follows:

Field

Description, Usage and Optionality

contigs

A list of chromosomes. This is used: (1) during methylation quantification as a specification of the primary chromosomes over which summary methylation quantification metrics should be calculated: mean CpG methylation rates will be calculated for each chromosome in this list; (2) during variant calling as a specification of the ‘large’ primary assembly contigs. This information is used to support parallelisation of variant calling processes: variant calling will be performed separately in parallel for chromosomes in this list. Variant calling on any other contigs (such as alt contigs or decoy sequences) will be performed in one further parallel step.

mapping

A regular expression that can be specified to combine methylation quantification summary statistics from contigs that have a common prefix. If the regular expression matches a contig name, then quantification statistics from that contig will be attributed to the first matching group returned when the regular expression is applied to the contig name. For example, the regular expression "([^_]+)_.+" (i.e. multiple characters up to, but not including, the first underscore) when applied to the contig chr1_KI270706v1_random would return chr1 as the first matching group, so would cause quantification data associated with chr1_KI270706v1_random to be attributed to chr1.

autosomes

A list of the autosomes. This is used during methylation quantification to specify the chromosomes that should be used when calculating genome-wide methylation levels. This excludes the allosomes as they would cause a sex-bias in the calculation of genome-wide methylation levels.

mitochondria

The name of the mitochondrial chromosome in the reference genome. This is used during methylation quantification to identify the contig over which the mitochondrial methylation rate will be calculated.

primary_assembly

A Boolean field indicating whether there is a separate subset of contigs that are considered to be the ‘primary assembly’. This should be set to “false” as a distinct primary assembly is not currently generated for alternative reference genomes.

species

Binomial name for the species.

primary_assembly_bed

The path to a primary assembly bed file if relevant. Usage is described above.

<<ref_genome>>

Reference Genome Consortium name supplied when launching the reference pipeline

<<species>>

Species supplied when launching the reference pipeline

<<ref_genome_path>>

Automatically applied when launching the reference pipeline. Location of the final reference genome files. This is the shared location expected by the duet pipeline

<<ref_pipeline_version>>

Automatically applied when launching the reference pipeline, additionally stored in the CLI config.yaml file. This is set during the biomodal reference pipeline download stage

The filename will be in the same data-bucket location as for other CLI results and will have the name of the chosen reference genome. Example is Ptrichocarpa_v4-1.config if you launched the pipeline with the reference genome Ptrichocarpa_v4-1.

Module hardware requirements#

Please see the complete overview of module requirements for the reference pipeline.

The module MAKE_BWA_MEM2_INDEX may require a large amount of memory but can be run on a single core. Looking at the BWA_MEM2 documentation, the recommendation is to allocate about 28 x the genome size. For example, the human genome is around 3 gigabases, and therefore requires 3 x 28 = ~84GB RAM. Note that this figure needs to include any additional decoy contigs that are part of the reference fasta you are using.

You may have to add a resource section in the nextflow.config file in the biomodal script folder to specify the hardware requirements for the MAKE_BWA_MEM2_INDEX module in the reference pipeline.

For example:

process {
  withName: 'MAKE_BWA_MEM2_INDEX' { cpus = 1
                                    memory = "320GB"}
}

Using new reference genomes in the duet pipeline#

To use this new reference genome with the biomodal duet +modC evoC multiomics solution bioinformatics pipeline, please include the following parameter to the biomodal analyse command:

--reference-genome <reference-genome>
--reference-genome-profile <reference-genome-profile-file-path>

The --reference-genome parameter should be set to the name of the reference genome you have created. The --reference-genome-profile parameter should be set to the path of the reference genome profile you have created. The default location of this file is in the output directory you specified for the biomodal make reference command.

D(h)MR calling workflow#

Overview#

  • The D(h)MR calling workflow is used to identify differentially methylated or differentially hydroxymethylated regions between groups of samples analysed using the duet pipeline

  • The D(h)MR calling workflow uses the Zarr store output from the duet pipeline as an input. It can also take multiple Zarr stores from different executions of the duet pipeline (e.g. from processing different sequencing runs) as an input

  • The D(h)MR calling workflow requires a sample sheet listing the samples to be used for D(h)MR calling, the condition associated with each sample, and information about covariates that should be taken into account

  • Samples are grouped by condition and DMR calling is performed between all pairwise combinations of conditions

  • A typical example might be to call DMRs between a set of cancer samples and a set of healthy samples, accounting for sex and age as covariates

  • With duet +modC, DMRs are called based on modC (modC is the union of mC and hmC modifications)

  • With duet evoC, mC is distinguishable from hmC, and DMR calling is performed separately for mC modifications and for hmC modifications. It is also possible to call DMRs based on modC - i.e. on the union of mC and hmC calls

Running the D(h)MR calling workflow#

biomodal call_dmr Run D(h)MR analysis

call_dmr                               Run D(h)MR analysis
--input-path <input_path>              Required: full custom bucket uri or directory path with Zarr stores to be analysed
--dmr-sample-sheet <sample_sheet_name> Required: full path and name of the DMR sample sheet
--output-path <output_path>            Required: custom output disk location or bucket uri
--tag <tag>                            Required: custom tag to identify this DMR run
--mode <5bp|6bp>                       Required: 5 base vs 6 base mode, default is '5bp'
--condition <condition(s)>             Required: a single DMR sample sheet column that contains the conditions between which to call DMRs
--covariates <covariates>              Optional: one or more DMR sample sheet columns that contain covariates to account for during DMR calling
--dmr-bed-path <dmr_bed_path>          Optional: a path to a bed file defining regions that DMR calling should be restricted to
--evoc-modifications <mc|hmc|modc>     Optional: single or comma separated list of duet evoC modifications to call, 'mc' or 'hmc' in 6bp mode only
--min-depth <min_depth>                Optional: contexts will only be removed if coverage <= min-depth in ALL SAMPLES, default is '0'
--window-size <window_size>            Optional: window size for DMR analysis, default is '1000'
--additional-params <params>           Optional: additional single or comma separated parameters, e.g. 'param1=value1,param2=value2'
--run-name <run_name>                  Optional: custom run name to identify this analysis run
--resume                               Optional: resume previous run using successfully completed tasks, default is 'false'
--work-dir                             Optional: override the default path where the workflow temporary data is stored

Small example:

biomodal call_dmr \
  --input-path gs://my-bucket-name/zarr_files/ \
  --output-path gs://my-bucket-name/zarr_files/output \
  --dmr-sample-sheet gs://my-bucket-name/zarr_files/dmr_sample_sheet.tsv \
  --dmr-bed-path gs://my-bucket-name/zarr_files/my_bedfile.bed \
  --mode 5bp \
  --condition disease_state \
  --tag dmr_run1 \
  --run-name dmr_run1

D(h)MR calling workflow input and output file requirements#

D(h)MR calling workflow input file and parameters#

Path to directory containing your Zarr file(s)

  • This should be a path to a directory containing the multi-sample Zarr store that got generated from running the duet pipeline, i.e. the sample_outputs/zarr_store/ subdirectory of duet pipeline outputs

  • This can also be a directory where you have brought together several multi-sample Zarr stores generated from running the duet pipeline (for instance if you have separately processed multiple sequencing runs)

  • The Zarr store is a multi-sample, multi-dimensional, compressed, indexed data store.

Single type of multi-sample Zarr store for each workflow run

Please ensure that the input path includes only Zarr stores of a single cytosine context (CG, CHG, or CHH) before running the call_dmr workflow. Separate Zarr stores of different cytosine contexts into distinct folders or paths as needed.

DMR sample sheet

  • The DMR sample sheet should be a tsv file with a header, and with one row per sample that you want included in the DMR calling

  • The columns should include a sample_id column, a column featuring the conditions between which you wish to call DMRs, and (optionally) columns for any covariates associated your samples. DMR calling will be performed between each pairwise comparison of conditions

  • Please note that group cannot be used as a column name as this is a reserved word

  • Here is an example of a sample sheet that would be suitable for calling DMRs between healthy and cancer samples, accounting for sex and age as covariates:

sample_id    disease_state   sex    age
BIO-A1       healthy         M      62
BIO-A2       healthy         M      58
BIO-A3       healthy         F      66
BIO-A4       cancer_stage1   F      66
BIO-A5       cancer_stage1   M      65
BIO-A6       cancer_stage1   M      55
BIO-A7       cancer_stage4   F      61
BIO-A8       cancer_stage4   F      59
  • Using the above example, the parameters --condition disease_state –covariates sex, age DMR calling would be performed sequentially for “healthy vs cancer_stage1”, “healthy vs cancer_stage4”, and “cancer_stage1 vs cancer_stage4” using “sex” and “age” as covariates

DMR bed path

  • Should be a path to a bed file containing regions that DMR calling should be performed on. DMR calling will be performed between each condition for each region listed in the bed file, --dmr-bed-path cannot be used in conjunction with --window-size. For instance, you may wish to restrict DMR calling to a set of promoters that you are interested in

  • BED format is zero-based and half-open, meaning the first position on contig is 0, and the End coordinate is excluded from the interval.

    Note

    Some methylation quantification outputs from the duet pipeline (e.g. cxreport) have 1-based positions.

    chr1        1000171         1001137
    chr1        1019119         1020119
    chr1        1172879         1173879
    chr1        1232030         1232236
    chr1        1279435         1280435
    chr1        10031831        10032831
    chr1        10209569        10210569
    chr1        10397591        10398591
    

If you wish to perform Differentially Methylated Cytosine (DMC) analysis, i.e. single CpG regions, you will need to pass a BED which for the regions are single CpGs.

Note

The inputted DMR BED cannot contain more than 5 million intervals

chr1  10031845        10031846
chr1  10031846        10031847
chr1  10031915        10031916
chr1  10031916        10031917
chr1  10031917        10031918
chr1  10031918        10031919
chr1  10031925        10031926
chr1  10031926        10031927
chr1  10031947        10031948
chr1  10031948        10031949
chr1  10031949        10031950
chr1  10031950        10031951

Window size

  • If a dmr-bed-path is not supplied, DMRs will be called genome-wide. The genome will be split into window-size bp (default = 1000) regions to perform DMR analysis on.

    Note

    If a DMR BED is supplied, window-size will be ignored

  • Reducing the window size for genome-wide DMR calling increases the runtime and memory requirement. The minimum supported window size is 200bp

evoC modifications

  • In 5bp mode (duet +modC) DMRs can only be called on modC. In 6bp mode (duet evoC) DMRs will be called on mC and hmC separately by default. However, it is possible to call DMRs on any combination of mC, hmC or combined modC

  • To call hmC DMRs only at enhancer regions use --dmr-bed-path full-path/to-your/enhancer-intervals.bed --evoc-modifications hmc

  • To call mC DMRs only at promotor regions use --dmr-bed-path full-path/to-your/promoter-intervals.bed --evoc-modifications mc

  • To call mC, hmC, and combined modC DMRs on 500bp regions genomewide use --window-size 500 --evoc-modifications mc,hmc,modc

Min Depth

If --min-depth is supplied, CG contexts will be removed only if coverage <= min-depth in ALL SAMPLES. Note: this is stranded CG coverage

D(h)MR calling workflow output file formats and locations#

D(h)MR output BED

Field

Description

Chromosome

Intervals over which DMR is called

Start

Start position of the interval

End

End position of the interval

Name

Reserved for future use

Score

Reserved for future use

Strand

Reserved for future use

num_contexts

Number of CpG contexts in the

mean_coverage_across_samples

Mean CpG coverage across all samples

mean_mod_group_1

Mean modification fraction in group 1

mean_mod_group_2

Mean modification fraction in group 2

mod_fold_change

Fold change between group 1 and group 2

mod_difference

Difference between mean_mod_group_1 and mean_mod_group_2

test_statistic

Difference between the log likelihood of the full model (with group specified) and the null model (without group specified)

dmr_pvalue

Unadjusted p-value

dmr_qvalue

Adjusted p-value (Benjamini-Hochberg)

If the above DMR samplesheet was analysed in 6bp, we’d expect the following outputs to be generated in the --output-path directory:

DMR_num_mc_healthy_cancer_stage1.bed
DMR_num_mc_healthy_cancer_stage4.bed
DMR_num_mc_cancer_stage1_cancer_stage4.bed
DMR_num_hmc_healthy_cancer_stage1.bed
DMR_num_hmc_healthy_cancer_stage4.bed
DMR_num_hmc_cancer_stage1_cancer_stage4.bed

D(h)MR output summary

An additional summary.txt file is generated per output BED file which lists metadata about the DMR including the version of the biomodal software package used, the input files, and the full commands used. It also reports the number of DMRs which pass a q < 0.05 significance threshold.

Appendix#

Event codes shared with biomodal#

Event code

Description

Optional?

100

Downloaded the biomodal CLI package, via the biomodal API

No

101

Completed running the biomodal init command

Yes

102

Completed running the biomodal test command

Yes

103

Completed running the biomodal analyse command

Yes

104

Completed running the biomodal dmr_call command

Yes

105

Completed running the biomodal reference make command

Yes

200

Customer preferred to automatically share events and metrics reports. Recorded once during the biomodal init step

Yes

201

Customer preferred to not share events and metrics reports automatically. Recorded once during the biomodal init step

Yes

202

User from the biomodal CRM system has registered for a software download and authentication account, via biomodal API

No

Additional available workflows (advanced users only)#

The nf-core community has several publicly available Nextflow pipelines. The nf-core pipelines are designed to be easy to use and are well documented. You can find a list of the nf-core pipelines at https://nf-co.re/pipelines.

nf-core pipelines may have a similar technical setup as your pipeline execution environment, so it is worth checking if there are relevant Nextflow pipeline settings and parameters you can apply to your Nextflow environment.

Glossary of terms#

Apptainer#

An open-source container platform designed for high-performance computing (HPC) environments, enabling users to package entire scientific workflows including software, libraries, and data, into a portable format.

BAM file (Binary Alignment Map)#

A binary version of a SAM file that stores aligned sequencing reads. Commonly used for efficient storage and analysis of sequence alignment data.

BaseSpace#

A cloud-based platform developed by Illumina for storing, analysing, and sharing genomic data generated by Illumina sequencers.

Command Line Interface (CLI)#

A text-based interface used to interact with software and operating systems by typing commands.

Conda#

An open-source package and environment management system that helps install and manage software dependencies, particularly in Python and R-based projects.

CPU (Central Processing Unit)#

The part of a computer that performs most of the processing. In bioinformatics, CPU power is important for running compute-intensive tasks.

CRAM file#

A compressed version of a BAM file that reduces file size by using reference-based compression. Used for efficient storage of sequencing alignment data.

D(h)MR (Differentially [hydroxy]Methylated Region)#

Genomic regions that show differences in DNA methylation or hydroxymethylation levels between samples or conditions.

Docker#

A platform that uses container technology to package software and its dependencies into isolated environments, enabling reproducible execution across systems.

DISK#

Refers to the amount of disk storage space required or allocated for computational tasks, such as storing large sequencing files.

FASTA file#

A text-based format for representing nucleotide or protein sequences. Each entry starts with a header line followed by the sequence.

FASTQ file#

A text format that stores both biological sequences (usually from next-generation sequencing) and corresponding quality scores.

GCP (Google Cloud Platform)#

A suite of cloud computing services by Google, often used for running scalable bioinformatics analyses and managing data.

High-Performance Computing (HPC)#

A computing environment that uses clusters of powerful computers to perform large-scale or resource-intensive computations.

JAVA#

A widely used programming language and computing platform often required by tools in bioinformatics pipelines, such as GATK.

Linux#

An open-source operating system commonly used in bioinformatics and server environments due to its stability and flexibility.

Nextflow#

A workflow management system designed for reproducible and scalable data analysis pipelines. Supports execution on local, cluster, and cloud environments.

nf-core pipelines#

Community-curated, standardised, and reproducible Nextflow pipelines for a wide range of bioinformatics analyses.

Orchestration VM#

A virtual machine responsible for managing or coordinating the execution of workflows and resources, especially in cloud-based analyses.

RAM (Random Access Memory)#

A type of computer memory that stores working data and code. Larger RAM allows more data to be processed simultaneously, which is critical for many bioinformatics tasks.

Screen#

A terminal multiplexer that allows multiple command-line sessions to be managed within a single terminal window. Useful for long-running processes.

Singularity#

The predecessor to Apptainer, Singularity is a container technology optimised for HPC and scientific computing environments.

Terraform#

An infrastructure-as-code tool that automates the provisioning and management of cloud infrastructure.

tmux (Terminal Multiplexer)#

Like Screen, tmux lets users run multiple terminal sessions in one window, detach and reattach to them, and manage long-running jobs.

TSV file (Tab-Separated Values)#

A plain text format for storing data in a tabular form, where each field is separated by a tab character.

VCF file (Variant Call Format)#

A standardised text file format for storing gene sequence variations, such as SNPs and indels, detected in sequencing data.

Virtual Machine (VM)#

A software-based emulation of a physical computer that runs an operating system and applications in an isolated environment.

Wall-time#

The actual elapsed time a task takes to complete from start to finish, as perceived by the user, regardless of CPU time used.

Zarr file#

A format for storing large N-dimensional arrays in a compressed, chunked manner. Useful in genomics for storing large datasets like 3D imaging or genomics matrices.