duet software installation and running guide#
For analysis of libraries sequenced on an Illumina platform and prepared using:
biomodal duet multiomics solution +modC
biomodal duet multiomics solution evoC
This documentation is compatible with duet pipeline v1.4
Warning
For Research Use Only. Not for use in diagnostic procedures.
Getting started with the biomodal duet pipeline#
Overview#
The biomodal duet multiomics pipeline is a bioinformatics workflow designed to analyse FASTQ files produced by next generation sequencing of the libraries generated using biomodal duet multiomics solutions. Before running the pipeline, you need to install the biomodal Command Line Interface (CLI) on a cloud platform or High-Performance Computing (HPC) cluster.
About this guide#
This guide walks you through:
Downloading the required files
Installing the CLI
Preparing your compute environment
Using the CLI
Importing the pipeline
Updating the pipeline
Important
The biomodal duet multiomics pipeline runs in a Linux environment and we recommend the use of a recent version of Ubuntu or CentOS/Rocky/RHEL. It is recommended that setup and installation be performed by your system administrator (i.e., the person responsible for managing your organisation’s IT systems.)
Sections which need to be completed by a system administrator will be shown by the following:
Caution
This section contains information intended for system administrators
Once the setup is complete, you’ll be able to run your analyses with minimal technical involvement. If you need support with any of the steps or installation, please contact us at support@biomodal.com.
What’s included in the biomodal duet pipeline#
Pipeline components
Standard and bespoke trimming of duet +modC and duet evoC FASTQ files
FASTQ resolution to convert raw paired-end duet +modC and duet evoC reads into resolved, error-suppressed 4-base genomic reads with accompanying epigenomic information
Reference genome and control alignment using BWA_MEM
Lane-merging of aligned BAM files
Forking of aligned genome reads and control reads into separate BAM files and analysis pathways
Deduplication of the genome and long control BAM files
Pipeline outputs
Modified cytosine quantification in CpG context
Optional modified cytosine quantification in CHG and CHH contexts
Germline variant calling with optional joint variant calling, somatic variant calling, and allele-specific methylation calling
Genetic accuracy analysis using controls
Analysis and generation of quality metrics associated with the genome-aligned reads and the control-aligned reads
Generation of summary reports in Excel and HTML formats
About the pipeline and Command Line Interface (CLI)#
Important
The biomodal CLI and duet pipeline is not supported on Windows platforms. We recommend you use a recent version of Ubuntu or CentOS/Rocky/RHEL.
The biomodal pipeline and CLI utilises Nextflow as the orchestration tool that will leverage the capabilities of your compute platform to process your data through the duet pipeline. Nextflow will act as an orchestrator performing the following tasks:
Launching Virtual Machines or HPC nodes
Copying input files to the virtual machines or HPC nodes
Downloading and running container images with appropriate software dependencies
Executing analyses on the virtual machines or HPC nodes
Transferring outputs from analyses to local or cloud storage
Organise output files into a convenient directory structure
Coordinating the channelling of outputs from one stage of analysis to the next
The following instructions are intended to be executed at the command
line with Bash on a Linux platform, or via a Cloud Shell in a browser.
Bash is a format of command structure command [OPTIONS]
arguments
. If you prefer to use the graphical user interface (GUI) of
your Cloud provider console through the browser, searching the command
line instructions will lead you to the platform specific documentation
with the steps that need to be taken in the relevant cloud console.
Support#
Feel free to contact us at support@biomodal.com. If your inquiry is
related to the CLI or duet pipeline software, please include the output
of the biomodal info
command in your inquiry.
Downloading the installation script#
Before installing the CLI, you need to download the biomodal installation package. This section helps you:
Choose the correct script for your system (HPC or Cloud)
Confirm software requirements
Download and unzip the installation files
Important
The following section involves cloud infrastructure setup or requires HPC admin privileges so must be performed by a system administrator. Once the setup is complete, the analyses can be run with minimal technical involvement.
If you need help with any of the steps or installation, please contact us at support@biomodal.com.
Choose your environment#
Caution
This section contains information intended for system administrators
The biomodal duet pipeline supports two environments:
HPC clusters (Ubuntu, CentOS/RHEL supported)
Cloud platforms (AWS, GCP, Azure)
Use the flowchart below to help you decide which installation method is right for you:

Check software requirements & permissions#
Caution
This section contains information intended for system administrators
HPC installation requirements#
Make sure your HPC system has the following tools installed:
Dependency |
Purpose |
Detail |
---|---|---|
bash |
Command execution |
Version 4+ |
JAVA (more information here) |
Required by Nextflow |
Version 8+ Please check that this is supported by your version of Nextflow. |
Nextflow (more information here) |
Pipeline orchestrator |
Version 21.06 onwards |
Singularity or Apptainer |
Container runtime |
Please check that this is supported by your version of Nextflow. The biomodal CLI expects to find Singularity in $PATH, so please make sure that Apptainer installations include a symlink to the Apptainer binary named Singularity. |
Google Cloud CLI (more information here) |
Software download and authentication |
This should be a recent version that supports the “gcloud storage” command, including the required gcloud-crc32c libraries. |
tar, unzip |
File compression tools |
|
rsync |
File transfer |
|
curl |
Software download |
|
jq |
For JSON parsing |
|
Other OS support packages: |
gnupg, lsb-release, apt-transport-https, gnupg, ca-certificates |
Some local Conda installations may require Google SDK components to ensure integrity of data being downloaded. This can be installed using:
gcloud components install gcloud-crc32c
Cloud installation requirements#
Cloud native software will be utilised on each respective platform to set up the complete pipeline environment. You will also need to install the Google Cloud CLI.
Dependency |
Purpose |
Detail |
---|---|---|
Google Cloud CLI (more information here) |
Required for all cloud installations |
This should be a recent version that supports the |
Cloud permissions (if using Cloud VMs)#
Caution
This section contains information intended for system administrators
We recommend that a least privilege approach is taken when providing users with permissions to create cloud resources.
The cloud specific examples below demonstrate the minimum required permissions to bootstrap resources for AWS, GCP, and Azure environments.
AWS#
Run the following script to create a complete AWS pipeline environment:
./biomodal-cloud-utils create aws
Ensure GitHub repository access is allowed, and
public IP assignment
is enabled for AWS Batch. This feature can be manually enabled in the console, via the Auto-assign public IPv4 address option in the subnet’s configuration.This will use an additional Terraform module we provide for resource setup (located here)
GCP#
We recommend using GCP’s predefined roles. These roles are created and maintained by Google, so you do not need to create custom roles.
Below are the recommended predefined GCP roles you require access to create cloud resources for running this CLI.
Role name |
Purpose |
Required |
---|---|---|
roles/storage.admin |
If a new bucket is required |
No (unless new bucket is specified) |
roles/artifactregistry.writer |
Create required artifact registry |
Yes |
roles/iam.serviceAccountCreator |
Creating required service account(s) |
Yes |
roles/compute.admin |
Creating required compute resources and instance template |
Yes |
roles/iap.admin |
Allowing IAP access to VMs |
Yes |
roles/resourcemanager.projectIamAdmin |
To assign project wide roles to service accounts |
Yes |
roles/storage.legacyBucketWriter |
Allow read/write permission on existing bucket |
No (Required if wishing to use an existing bucket) |
Azure#
If you are planning on using Azure as your cloud environment, please get in contact with us at support@biomodal.com for tailored setup instructions.
Download the installation files#
Caution
This section contains information intended for system administrators
We have enabled a simple tool to directly download all the required scripts to your local computer, HPC node, or cloud VM.
Please run the following command from a Linux terminal or cloud shell:
bash <(curl -s https://app.biomodal.com/cli/download)
You’ll be prompted to authenticate with your existing biomodal username and password:
If you do not have a biomodal account, please contact us at support@biomodal.com
To reset your password, click here: https://app.biomodal.com/auth/pages/reset-password
After download, unzip this file before you can start using the CLI scripts. Please note that you can use different tools to unzip the file (for example
unzip
) but you can choose other tools, likejar xvf
, to uncompress the file:unzip biomodal.1.1.3.zip
In the unzipped biomodal.1.1.3 folder, you will see the following files:
File or folder |
Description |
---|---|
|
Script to create AWS, GCP or Azure could environment |
|
System administrator installation script |
|
Conda-based installation script |
|
Helper functions for biomodal CLI |
|
Validator functions for biomodal CLI |
|
Main biomodal CLI script |
|
Customer authentication configuration settings |
|
Folder containing Terraform script software for each supported Cloud platform |
|
Folder with copy of all biomodal CLI documentation |
|
Images used by documentation |
Authentication & configuration#
Caution
This section contains information intended for system administrators
Upon login, a
~/.biomodal-auth.json
file will be created to control the tokens required to authorise access to the biomodal resourcesThe default location of this file is in the users $HOME directory
Each user needs their own token file for running the CLI if event and metrics sharing is enabled (for more information, please see Sharing event & metric data with biomodal)
About software updates and sharing usage data#
Caution
This section contains information intended for system administrators
By default, the biomodal pipeline checks for software updates via Nextflow. To disable this feature:
export NXF_DISABLE_CHECK_LATEST=true
The CLI can send runtime event logs to biomodal to help support teams troubleshoot. For more information, please see Sharing event & metric data with biomodal.
Installing the biomodal CLI#
Important
The following section involves cloud infrastructure setup or requires HPC admin privileges so must be performed by a system administrator. Once the setup is complete, the analyses can be run with minimal technical involvement.
If you need help with any of the steps or installation, please contact us at support@biomodal.com.
Preventing timeout#
Caution
This section contains information intended for system administrators
Because some analyses can run for several hours or even days, we
recommend running installation and pipeline commands inside a terminal
multiplexer like tmux
or screen
. This ensures
your job won’t be interrupted if your connection drops or you close your
terminal window.
Installation on Cloud VMs (AWS / GCP / Azure)#
Caution
This section contains information intended for system administrators
We have created the biomodal-cloud-utils
script to assist
with the creation of cloud resources required to run the biomodal duet
pipeline with Nextflow. The currently supported clouds are GCP, AWS, and
Azure.
Please follow the following steps from a Linux compatible computer:
Install Terraform: see this link for OS-specific instructions
Ensure you have permissions to create cloud resources
Authenticate using your cloud provider CLI:
GCP:
gcloud auth login
AWS:
aws configure
Azure:
az login
Run the create command for your cloud:
./biomodal-cloud-utils create gcp ./biomodal-cloud-utils create aws ./biomodal-cloud-utils create azure
Connect to the new environment e.g.:
./biomodal-cloud-utils connect gcp
When complete, you can choose to remove the newly created resources e.g.:
./biomodal-cloud-utils destroy gcp
Warning
IMPORTANT INFORMATION
WARNING: Removing your resources is a permanent action, only run this command when you are sure you no longer need them. Buckets you added additional files to should not be deleted.
Note
The cloud VM will be created with a public IP to allow for data ingress and egress. Please update this as required after software install, authentication, configuration, and testing.
Installing on HPC clusters#
Caution
This section contains information intended for system administrators
We have created two scripts to assist with installing required software and downloading the pipeline on an HPC cluster, or standalone server:
biomodal-hpc-utils-admin
: Recommended for centralised deploymentbiomodal-hpc-utils-conda
: Used if the user lacks the necessary permissions to install software centrally, or when local user software installations need to be separate to not interfere with other installations - but this option is not preferred
Important
Please check and review every step before proceeding as every HPC is different. The following instructions are intended as a guide only.
The software installation scripts described in this section are not intended to be re-run when you upgrade to a newer version of the biomodal CLI or duet pipeline. Please see Updating the biomodal pipeline for more information on how to update the CLI and pipeline software.
Restrictive quotas on local disk resources and software install limitations are likely to cause issues during installation and update operations.
Admin installation (recommended)#
The admin installation script sets up everything your system needs to run the biomodal pipeline on a HPC cluster or standalone Linux server.
This script will:
Install necessary OS and system software (e.g. Singularity)
Create directories for input/output data
Configure access to biomodal resources
Generate the required configuration files
Executing the scripts will ask to confirm that you have read the documentation and understand the implications of running this script on your system.
Please confirm that you have read the documentation and
understand the implications of running this script on your system.
Important
While the script is interactive and user-friendly, it requires root (super-user) privileges to install software packages. Please ensure this is done by your system administrator. If you have any questions, please contact support@biomodal.com.
Note
The suggested HPC setup will use Singularity as an alternative to Docker, but please note that Singularity does not enable, nor propagate, root privileges to the Singularity images.
We recommend using tmux
or screen
to keep sessions alive for
long-running tasks. See Preventing timeout for
more information.
To begin run the following script:
./biomodal-hpc-utils-admin
Follow the prompts and answer the required questions
CLI installation path
Enter cli installation path. : /biomodal
This is where the CLI software will reside on your system. Ideally choose a central location that users on the system can access if needed:
Default: /biomodal
Alternative: Choose any directory with enough free space and write access.
Note
The HPC users that will run the biomodal pipeline will require read and execute access to this location
Data bucket path
Enter data bucket path - shared location where ref data and images will be stored: /biomodal/data_bucket
This path will store reference data, pipeline images, and shared resources.
Default: /biomodal/data_bucket
Alternative: Choose a shared network storage location, if working in a group
This location can be re-used for other biomodal runs or installations in the future
Choose Docker registry location closest to your physical location or data centre. This ensures faster downloads and fewer errors during pipeline execution.
Choose relevant geographical location of the biomodal docker registry (1-3)? 1) asia 2) europe 3) us #?
Set max concurrent jobs
Current queueSize is: 200 Would you like to change it? (y/n)
This controls how many jobs Nextflow can run at once. If jobs keep failing or hanging, try lowering this value later.
Yes: Enter a lower number if you’re working with limited HPC resources
No: Keep the default of 200 if you’re unsure
Select job scheduler (executor)
Which executor would you like to use? (choose 1-4) 1) slurm 2) lsf 3) sge 4) local #?
Select the job scheduler used by your HPC system:
slurm
,lsf
, orsge
: Choose based on what your HPC useslocal: Use this only if you’re running everything on a single server without a scheduler
Once the script finishes you may be prompted to log out and back in again to apply the changes. In some cases, you might need to restart Linux system kernel services
Your system is now ready to import and run the biomodal duet pipeline.
If you have any questions, please contact support@biomodal.com.
Conda installation (not preferred)#
Caution
This section contains information intended for system administrators
While the CLI may be available via a Conda environment we do not recommend installing or running the biomodal CLI this way for production use. There are several important reasons for this:
File size limitations: Local Conda environments often enforce lower limits on maximum file and path sizes, which can cause problems during large-scale analysis
Data duplication: Using a local Conda environment may result in duplicated software resources across environments, increasing disk usage unnecessarily
Lack of central management: Installing the CLI via Conda means it is not centrally configured or controlled, which can introduce inconsistencies across users and compute nodes in the cluster
Important
For consistency, reliability, and optimal performance, we strongly recommend running the biomodal CLI using the admin installation script (see Admin installation (recommended)) at the network or shared software module level. This ensures environment variables, shared directories, and version updates are centrally managed across all users and nodes.
Executing the scripts will ask you to confirm that you have read the documentation and understand the implications of running this script on your system.
Please confirm that you have read the documentation and
understand the implications of running this script on your system.
Are you sure you want to run this script? (y/n)
Important
Only use this if you are experienced with Conda and your cluster’s Linux environment. This method does not allow for shared installation and may lead to redundant storage usage.
Do not assume that all steps in these scripts will be an exact match
for your local server or HPC nodes. If you choose to use the
biomodal-hpc-utils-conda
script, we recommend you use an existing
working conda environment.
Note
Singularity or Apptainer must be installed outside of the Conda environment, by a root or sudo user.
We recommend using tmux
or screen
to keep sessions alive for
long-running tasks. See Preventing timeout for
more information.
To begin run the following script:
./biomodal-hpc-utils-conda
Follow the prompts and answer the required questions
CLI installation path
Enter cli installation path, must be a directory you have read/write/execute access to. : /biomodal
This is where the CLI software will reside on your system. Ideally choose a central location that users on the system can access if needed.
Default: /biomodal
Alternative: Choose any directory with enough free space and write access.
Note
The HPC users that will run the biomodal pipeline will require read and execute access to this location
Data bucket path
Enter data bucket path - shared location where ref data and images will be stored: /biomodal/data_bucket
This path will store reference data, pipeline images, and shared resources.
Default: /biomodal/data_bucket
Alternative: Choose a shared network storage location, if working in a group
This location can be re-used for other biomodal runs or installations in the future
Choose Docker registry location closest to your physical location or data centre. This ensures faster downloads and fewer errors during pipeline execution.
Choose relevant geographical location of the biomodal docker registry (1-3)? 1) asia 2) europe 3) us #?
Set max concurrent jobs
Current queueSize is: 200 Would you like to change it? (y/n)
This controls how many jobs Nextflow can run at once. If jobs keep failing or hanging, try lowering this value later.
Yes: Enter a lower number if you’re working with limited HPC resources
No: Keep the default of 200 if you’re unsure
Select job scheduler (executor)
Which executor would you like to use? (choose 1-4) 1) slurm 2) lsf 3) sge 4) local #?
Select the job scheduler used by your HPC system:
slurm
,lsf
, orsge
: Choose based on what your HPC useslocal: Use this only if you’re running everything on a single server without a scheduler
If you have not activated a Conda environment before starting the installation, miniconda will be installed. You will see the following message and prompted to provide a location for your new Conda environment.
conda not found, installing miniconda.
Enter dependency installation root directory (default: home directory)
The script will now complete the installation of the remaining third-party software.
Note
Approximately 150.000-160.000 files (external version dependent at the time of installation) will be installed in your Conda environment.
Once the script finishes you may be prompted to log out and back in again to apply the changes.
Your system is now ready to import and run the biomodal duet pipeline.
HPC set up recommendations#
Caution
This section contains information intended for system administrators
Overview#
When running the duet pipeline on HPC clusters, it’s important to configure your environment correctly to avoid job failures and ensure optimal performance. This section outlines recommended settings for managing temporary directories, scratch space, memory and CPU limits, and scheduler-specific configurations.
These shell environment settings/variables should be added to your
cluster environment configuration files. These include bashrc
,
.bash_profile
, .profile
(or similar) in order to
ensure they are set correctly when the jobs are submitted on your
cluster.
Note
Defining these variables within your startup scripts will define these variables for the respective user, and so could impact other pipelines using Apptainer/Singularity. Please consult with your cluster administrator before making these changes.
If not using the duet software for an extended period of time, it is recommended to remove or comment out these lines until resuming analysis with the duet software pipeline.
Important
Always validate changes by running biomodal test before processing full datasets.
Confirm with your system administrator if you’re unsure about your scheduler’s default behaviour.
Allocating sufficient space for temporary files#
The pipeline (especially when using Singularity or Apptainer containers)
creates large temporary files in /tmp
. If your cluster has limited space
in this directory, jobs may fail due to insufficient disk space. We
recommend the following:
Bind a larger, custom temporary directory in your
nextflow.config
:singularity { enabled = true autoMounts = true runOptions = "--bind /scratch/tmp:/tmp" }
This tells Singularity: “Use
/scratch/tmp
for your temp files instead of the default/tmp
.Alternatively, use environment variables for flexibility:
singularity { enabled = true autoMounts = true envWhitelist = "TMPDIR,SINGULARITY_TMPDIR,SINGULARITY_CACHEDIR,NXF_SINGULARITY_CACHEDIR" runOptions = '--bind "$TMPDIR:/tmp"' }
Set the following in your shell environment:
export TMPDIR=/your/larger/tmp export SINGULARITY_TMPDIR=$TMPDIR
(Optional) to manage where containers are downloaded and stored, define:
export SINGULARITY_CACHEDIR=/your/cache/dir export NXF_SINGULARITY_CACHEDIR=$SINGULARITY_CACHEDIR
Ensure that this cache directory matches the
libraryDir
setting innextflow.config
.
Additionally, try adding wider cluster options to the process section in
nextflow.config
if you are still experiencing issues related
to temporary files:
process {
executor = "slurm"
containerOptions = "--bind /<your local tmp path>:/tmp"
}
Important
We strongly recommend that both TMPDIR
and SINGULARITY_TMPDIR
environment variables are set on the head/controller node before any
jobs are submitted. This should ensure that the temporary files are
written to the correct location.
If you don’t do this, and your /tmp directory fills up, you may see an error like this:
Error in tempfile() using template /tmp/parXXXXX.par: Parent directory does not exist Error in tempfile() using template /<your local tmp path>/parXXXXX.par: Parent directory (/<your local tmp path>/) does not exist at /venv/bin/parallel
Please note that the variable names may change slightly depending on whether you are using Singularity or Apptainer:
Singularity variable |
Apptainer variable |
---|---|
|
|
|
|
|
|
|
|
Ensure your Apptainer installation includes a Singularity symlink in your
$PATH
and add these variables to .bashrc
or equivalent config files.
Per-task or CPU memory reservation for LSF#
Default LSF cluster settings works with a per-core memory limits mode,
so it divides the requested memory by the number of requested cpus. If
your LSF cluster is configured differently, it is recommended that you
try to add the following settings to the LSF executor section in the
nextflow.config
file in the biomodal script folder.
executor {
name = "lsf"
perTaskReserve = false
perJobMemLimit = true
}
LSF Executor scratch space#
You can dynamically use LSF scratch space per job using the following
settings in the nextflow.config
file in the biomodal script
folder.
process {
executor = "lsf"
scratch = "$SCRATCHDIR/$LSB_JOBID"
}
This ensures temporary files are written to isolated, job-specific directories and cleaned up after job completion.
Wall-time and minimum CPU settings#
In some clusters, it is recommended to set a “wall-time”, i.e. the max
time a job can run before it is terminated. There may also be a “minimum
CPU” and/or queue requirement to start jobs in your cluster. You can set
these in your nextflow.config
:
process {
executor = "slurm"
memory = “16GB”
cpus = 4
nodes = 2
time = "24h"
params.cpuNum = 1
queue = "default"
}
Adjust executor
, time
, queue
, and
cpuNum
to match your local scheduler and policies.
Set queue, CPU, RAM and DISK per module#
The duet pipeline supports fine-grained resource customisation per
module using withName:
in the process
section.
This allows you to allocate additional resources for memory- or
CPU-intensive steps. For example, you can set the CPU, RAM, or DISK
requirements for the BWA_MEM2
and PRELUDE
modules as
follows:
process {
// Default settings for all modules
cpus = 16
memory = "16GB"
queue = "default"
time = "24h"
disk = "800GB"
// Specifically overriding settings for a module
withName: 'BWA_MEM2' {
memory = "64GB"
queue = "high-mem"
time = "48h"
}
withName: 'PRELUDE' {
cpus = 32
memory = "32GB"
queue = "large-cpu"
}
}
Use this to align specific pipeline steps with available HPC queues and avoid resource contention or throttling.
Memory settings for SGE/OGE/UGE clusters#
If you’re on a cluster that doesn’t support memory flags like
h_rt
, h_rss
, or mem_free
, you can
pass memory requirements via clusterOptions
.
Option 1: Apply globally
process {
cpus = 16
memory = "16GB"
queue = "short"
time = "24h"
disk = "800GB"
clusterOptions = { "-l h_vmem=${task.memory.toString().replaceAll(/[\sB]/,'')}" }
}
Option 2: Apply per module
withName: 'BWA_MEM2' {
cpus = 32
clusterOptions = "-l h_vmem=64GB"
memory = null
queue = "long"
time = "48h"
}
If your cluster uses per-core rather than per-job memory limits, divide
the total memory by the number of CPUs when setting h_vmem
.
Manual Installation for HPC users (Advanced)#
Caution
This section contains information intended for system administrators
You can opt to run a manual install of the CLI and Nextflow config files
if preferred. This section explains how to manually install and
configure the biomodal CLI and Nextflow settings if you’re not using the
biomodal-hpc-utils-admin
or biomodal-hpc-utils-conda
script.
Important
This setup is only intended for system administrators or advanced users who are comfortable managing configuration files and working with HPC environments.
Step 1: Copy the scripts
Follow the instructions in Manual update of the CLI and pipeline to copy the scripts, make them
executable, and create the error configuration files into your preferred
biomodal cli installation directory (e.g. /biomodal
).
Step 2: Create the config.yaml
file
The CLI requires a configuration file named config.yaml
to
define where to install and run the pipeline. Create this file in your
chosen biomodal CLI installation directory, with the following content:
platform: hpc
duet_version: 1.4.2
ref_pipeline_version: 1.0.1
bucket_url: <path to your input/output data storage>
work_dir: <path to a local or shared scratch directory>
init_folder: <path where the duet pipeline software will be installed>
biomodal_registry: <see table below>
biomodal_api_baseurl: https://app.biomodal.com
biomodal_releases_bucket: gs://cegx-releases
biomodal_reference_bucket: gs://release-reference-files
share_metrics: <see table below>
error_strategy: FailFast
Key information:
Content |
Detail |
---|---|
|
Directory or bucket where input/output data is stored |
|
A fast, temporary directory used during execution. It should have ~3x the input data size available |
|
If set to |
|
The location where the duet pipeline and workflow tools will be stored |
|
Choose one based on your region
|
Step 3: Create the nextflow.config
file
This file is required to tell Nextflow how to execute jobs on your system and where to find container images. This local configuration file is applied after any settings defined in the standard duet pipeline configuration files.
Nextflow will automatically cascade the settings in the duet pipeline
configurations, followed by the nextflow.config
in the
versioned biomodal script folder within the init_folder directory (see
table above). This ensures that your customised settings are always
retained in the nextflow.config
file in the biomodal script
folder when upgrading to any new releases of the duet pipeline.
Create nextflow.config
in the biomodal CLI installation
directory and with the following content:
singularity {
enabled = true
autoMounts = true
libraryDir = "<starts with same path as 'bucket_url' in config.yaml> + '/singularity-images'"
}
params {
registry = "<see below>"
}
process {
executor = "slurm" // or lsf, sge, or local
}
report {
overwrite = true
}
Choose one registry based on your region:
Content |
Detail |
---|---|
|
|
Fill the process
section according to your HPC job scheduler (select
local
if no HPC scheduler is present) to one of the following:
process {
executor = "slurm"
}
process {
executor = "lsf"
}
process {
executor = "sge"
process.penv = "smp"
clusterOptions = "-S /bin/bash"
//If your system is not supprting h_rt/h_rss/mem_free settings,
//you can try to use the following settings:
//clusterOptions = { "-l h_vmem=${task.memory.toString().replaceAll(/[\sB]/,'')}" }
}
process {
executor = "local"
}
These settings are layered on top of the duet pipeline’s own configuration and will remain active across pipeline upgrades. For more information, see HPC set up recommendations.
Using the CLI#
This section walks you through how to use the biomodal CLI, from running commands and importing the pipeline. It also includes input file requirements, troubleshooting tips, and details on expected outputs.
Important
Please read through this entire section before starting your analysis and save for future reference.
Command structure and usage#
biomodal
This will display:
biomodal v1.1.3
usage: biomodal [SUBCOMMAND] [OPTION]...
Key CLI commands |
Detail |
---|---|
|
Login to biomodal |
|
Import the duet pipeline, reference files, Docker images, and test data |
|
Perform a test run using the provided dataset |
|
Run the duet analysis pipeline on your data |
|
Check your parameters before launching a run |
|
Run the D(h)MR analysis |
|
Manage reference genome data and pipelines |
|
Share a metrics report with biomodal |
Each command has parameters and options. Use the biomodal
command for more details.
Input file requirements#
The following sections explain how to prepare your input data and directory structure.
Inputs
An optional metadata file can be used in .csv format detailing samples used in the experiment
A folder named
nf-input
containing the FASTQ files to be run in the pipeline. The FASTQ files require a specific format for:File names
Sequence identifier
Inferred instrument type
These inputs are used to create a Read Group used during alignment. For more information, please see FASTQ filename format.
File structure#
The --input-path
folder provided for the duet pipeline must
contain the nf-input
folder (and the metadata .csv file).
Your input directory should look like this:
my_input_path
├── biomodal_run_meta.csv
└── nf-input
├── Sample-1_S1_L001_R1_001.fastq.gz
└── Sample-2_S1_L001_R1_001.fastq.gz
Note
For the D(h)MR workflow, please note that the --input-path
parameter should point to the full path of the relevant Zarr stores.
For more information on the D(h)MR workflow, see D(h)MR calling workflow.
FASTQ filename format#
FASTQ filenames must follow this pattern satisfying the naming convention used in BaseSpace:
{sample-id}_{sample-number}_{lane}_{R1|R2}_001.fastq.gz
Rules:
No underscores in
sample-id
Sample IDs should not start with a number
Must include the sample-number field (even if ignored)
Example: two samples, each pooled across two lanes |
|
---|---|
CEG900-123-678_S1_L001_R1_001.fastq.gz |
CEG900-123-678_S1_L001_R2_001.fastq.gz |
CEG900-123-678_S1_L002_R1_001.fastq.gz |
CEG900-123-678_S1_L002_R2_001.fastq.gz |
CEG900-123-679_S2_L001_R1_001.fastq.gz |
CEG900-123-679_S2_L001_R2_001.fastq.gz |
CEG900-123-679_S2_L002_R1_001.fastq.gz |
CEG900-123-679_S2_L002_R2_001.fastq.gz |
FASTQ sequence identifier requirements#
The format of the sequence identifiers in FASTQ files can differ depending upon the software that was used to generate them. The pipeline requires the read identifiers in FASTQ files to comply with the format of the following example:
@A00536:706:HJNLHDRX3:1:2101:2718:1031 1:N:0:ACGGAACA+ACGAGAAC
The table below describes each of the identifiers and their purpose:
Example data |
Description |
---|---|
|
Instrument name |
|
Run ID |
|
Flowcell ID |
|
Flowcell lane |
|
Tile number |
|
x-coordinate |
|
y-coordinate |
|
Member of a pair |
|
Filtered |
|
Index sequences |
Inferred instrument type#
The pipeline will infer the instrument type from the instrument name extracted from the first read in each FASTQ file based on the following convention:
Read ID begins with |
Instrument inferred as |
---|---|
@A |
NovaSeq6000 |
@LH |
NovaSeqX |
@VL |
NextSeq1000 |
@VH |
NextSeq2000 |
Others |
Unknown |
Use of the extracted flowcell ID and flowcell lane#
The sample ID extracted from the FASTQ filename, and the flowcell ID and flowcell lane extracted from the read ID of the first read in the FASTQ file will be used to construct a Read Group which will get passed into the aligner during the alignment step.
The following example shows the Read Group (@RG
) generated
for sample ID CEG900-123-678
, flowcell ID
HJNLHDRX3
and flowcell lane 1
:
@RG ID:HJNLHDRX3.1 PL:ILLUMINA PU:HJNLHDRX3.1.CEG900-123-678
LB:CEG900-123-678 SM:CEG900-123-678
Metadata file#
The pipeline can use an optional metadata file in .csv format, with one column per sample. The remaining cells in row 1 should contain sample IDs, and the remaining cells in column A should contain metadata field names. There are no requirements about what these field names are, nor the quantity of them.
The pipeline assumes that this file will be encoded in ASCII and that fields and field names will not contain any commas.
Must be named in the
--meta-file
argumentMust be placed in the same folder as the FASTQ files
The filename can be anything, but the --meta-file
argument
must match exactly to the name of the metadata file. Please only use the
metadata filename, not the full path.
The metafile will not control which samples are analysed in the duet
pipeline, all samples in the nf-input directory in your
--input-path
location will be processed.
Optional metadata file in .csv format example:
sample_id,CEG900-123-678,CEG900-123-679,CEG900-123-680,CEG900-123-681
sample-condition,case,control,case,control
sample-sex,male,male,female,female
lab-technician,JM,JM,ST,ST
project-id,X001,X001,X002,X002
Before you start#
Avoid session timeouts in long analysis runs#
To avoid session timeouts during long analysis runs, use a terminal multiplexer. For more information, please see Preventing timeout.
Check processing and storage requirements#
For more information on the minimum requirements for each module in the pipeline, please see Module requirements.
Please check the minimum resource requirements to ensure you allocate sufficient CPU, RAM, and disk space depending on your chosen analysis profile. Your cloud tenancy or HPC cluster should be able to allocate a set of VMs/nodes with these resources.
You can edit resource requirements by referencing specific modules in
the nextflow.config
file in the biomodal script folder to
accommodate available hardware resources. Please see examples in the
following sections for more information on how to increase or decrease
the resource requirements. Please note that the duet pipeline is less
likely to run successfully if the minimum resources are too restrictive.
If you have any questions, please contact support@biomodal.com.
Elevated resource profiles#
Each module file in the duet pipeline contains default resource settings in relation to CPUs, memory, and disk. These settings are intended to be sufficient for shallow sequencing runs where you have up to 50 million reads per sample per lane.
On deeper sequencing runs, you should invoke an elevated resource which will allocate additional resources to jobs. For the deepest sequencing runs in UMI-mode, it is also necessary to invoke a CLI parameter that will distribute the duplicate identification step across intervals/contig. The parameters to use are described in the table below:
Reads per sample per lane |
Profile |
Additional parameter |
---|---|---|
Up to 50 million |
Default |
None |
Up to 500 million |
Deep_seq |
–additional-profile deep_seq |
Over 500 million |
Super_seq |
–additional-profile super_seq |
If using UMIs and very deep sequencing (>2B reads):
--additional-params dedup_by_contigs=true
Important
Invoking any of these profiles will require additional CPU, RAM, and disk resources to the duet pipeline. Please ensure you have sufficient storage and credit prior to starting any runs.
Limit local resources#
If you have to limit the duet pipeline CPU and RAM resources for the
local executor on a single machine, you can use the
cpu_num
and memory
parameters in conjunction with
the local or local_deep_seq profiles.
Note
Please note this only works with the local
executor, so
this will not have any effect if you are using HPC or cloud-based
executors.
Example for constrained local machine (limited to 32 CPUs and 128GB RAM):
--additional-profile local_deep_seq --additional-params cpu_num=32,memory="128GB"
Import the pipeline with biomodal init#
Before you start#
Once your cloud or HPC environment is set up, you can import the duet
pipeline and all required resources using the biomodal init
command.
To get started, run the following:
biomodal init
This command will:
Download the latest version of the duet pipeline and D(h)MR workflow (for more information on the D(h)MR workflow, see here)
Import all required reference files
Import the biomodal test dataset (RunXYZ)
Download and cache the required Docker or Singularity containers
Prepare the system for a successful test run and future analysis
Prompt you to enable metrics sharing
Important
Ensure you have authentication set up with your cloud provider
if using GCP, AWS, or Azure. Terraform must also be configured if you
are using the biomodal-cloud-utils
setup method.
Important
Please do not attempt to run biomodal test
or biomodal
analyse
if you experienced any problems during the bootstrapping or
biomodal init
process.
biomodal init
must always be successfully completed after a fresh
install and after any manual update of the CLI and/or duet pipeline.
Please review Elevated resource profiles to ensure you allocate sufficient hardware resources relative to the number of reads per sample before running the duet pipeline.
Running the biomodal init command#
During biomodal init
, you will be prompted to:
Enable event and metrics sharing with biomodal
Would you like to share the pipeline metrics report at the end of all successful analysis runs with biomodal? (1-2) 1) Yes 2) No #?
If enabled, performance metrics are shared to support ongoing improvements and help biomodal troubleshoot issues more effectively. No sample data is shared. For more information, please see Sharing event and metric data with biomodal.
Specify the duet pipeline installation location
The duet software location 'init_folder' property is not currently defined in the configuration file Would you like to set it to /<home_folder>? (1-2) 1) Yes 2) No #?
This sets where the CLI and pipeline components will be stored.
Choose and error handling strategy
Normal
: Retry failed jobs up to 10 timesFailFast
: Stop early if jobs fail
Please select which duet pipeline error strategy you would like to use: 1) Normal 2) FailFast #?
At this stage you will see the following downloads:
Genome reference data (unless already downloaded)
Container images (unless already downloaded)
Pipeline software
Small test dataset
Once complete, the CLI will download all files and verify the setup. If
any downloads are interrupted, you can clear the partial data and re-run
biomodal init
.
You’re now ready to proceed with a test run to confirm everything is working correctly.
Expected output#
Once you start a run (e.g. with biomodal test
), you’ll see an output like this:
N E X T F L O W ~ version 24.10.6
Launching /opt/apps/biomodal/latest/biomodal-duet-1.4.2/main.nf [boring_pasteur] DSL2 - revision: 65b941077a
executor > slurm (21)
[- ] resolve_align:FASTQC_RAW -
[1e/69d214] resolve_align:PRELUDE (CEG9330132-19-01, L001) [100%] 1 of 1 ✔
[- ] resolve_align:FASTQC_RESOLVED -
[b8/c8c72b] resolve_align:BWA_MEM2 (CEG9330132-19-01, L001) [100%] 1 of 1 ✔
[- ] resolve_align:SAMTOOLS_MERGE_LANES -
[- ] resolve_align:PRESEQ -
[d5/ec7739] bamlet_v2:ONE_STEP_BAMLET (CEG9330132-19-01) [100%] 1 of 1 ✔
[6e/df7d5c] bamlet_v2:QUALIMAP_BAMQC (CEG9330132-19-01) [100%] 1 of 1 ✔
[80/db515c] bam_stats_genome:TSS_BIAS (CEG9330132-19-01) [100%] 1 of 1 ✔
[- ] bam_stats_genome:DEEPTOOLS_PLOTFINGERPRINT -
[3a/292078] bam_stats_genome:PICARD_GCBIAS (CEG9330132-19-01) [100%] 1 of 1 ✔
[42/ff20ec] bam_stats_long_ctrl:LONG_CTRL_GENETIC_ACCURACY (CEG9330132-19-01) [100%] 1 of 1 ✔
[e2/880eac] variant_calling:HAPLOTYPE_CALLER (CEG9330132-19-01) [100%] 1 of 1 ✔
[- ] variant_calling:ASM -
[20/d45b69] mbias:GENOME_MBIAS (CEG9330132-19-01-CG-5-prime) [100%] 1 of 1 ✔
[08/442714] epiquant_v2_genome:EPIQUANT_BAM2ZARR (CEG9330132-19-01-CG) [100%] 1 of 1 ✔
[9f/594e09] epiquant_v2_genome:MODALITY_EXPORT (CEG9330132-19-01_CG) [100%] 1 of 1 ✔
[77/374a22] epiquant_v2_genome:JOIN_ZARR_STORES (CEGX_RunXYZ-genome-CG) [100%] 1 of 1 ✔
[b8/89e666] epiquant_v2_short_ctrl:EPIQUANT_BAM2ZARR (CEG9330132-19-01-CG) [100%] 1 of 1 ✔
[4e/73c13b] epiquant_v2_short_ctrl:MODALITY_EXPORT (CEG9330132-19-01_CG) [100%] 1 of 1 ✔
[- ] epiquant_v2_short_ctrl:JOIN_ZARR_STORES -
[8f/35d181] epiquant_v2_long_ctrl:EPIQUANT_BAM2ZARR (CEG9330132-19-01-CG) [100%] 1 of 1 ✔
[c4/74201b] epiquant_v2_long_ctrl:MODALITY_EXPORT (CEG9330132-19-01_CG) [100%] 1 of 1 ✔
[1c/575553] epiquant_v2_long_ctrl:JOIN_ZARR_STORES (CEGX_RunXYZ-long_ctrl-CG) [100%] 1 of 1 ✔
[e0/6821c2] summary:DQSPY (CEGX_RunXYZ) [100%] 1 of 1 ✔
[31/f8e5ed] summary:PIPELINE_REPORT (CEGX_RunXYZ) [100%] 1 of 1 ✔
[2b/9cc002] summary:DQSREPORT [100%] 1 of 1 ✔
[ad/d278e7] summary:MULTIQC (CEGX_RunXYZ) [100%] 1 of 1 ✔
Completed at: 15-Apr-2025 15:52:10
Duration : 33m 51s
CPU hours : 5.3
Succeeded : 21
Output directories overview#
Outputs from the Nextflow pipeline get published to your cloud storage bucket or output directory, organised into a directory structure.
Top-level:
gs://my-org-nextflow/test-runxyz
Subdirectories:
Folder |
Description |
Detail |
---|---|---|
|
FASTQ input files |
FASTQ files should be copied into this directory prior to launching the pipeline. Please note that all FASTQ files undergo analysis regardless of sample info in the meta file. |
|
Intermediate files, logs |
This contains logs and temp data, organised by Nextflow job ID. This is useful for debugging. |
|
Final analysis outputs |
This is the directory where the outputs of the pipeline are stored, organised by pipeline run, sample, pipeline module. This directory is described further below. |
nf-work folder#
This is the working directory for Nextflow where logs and files staged for downstream processes are stored. There will also be a directory that includes the pipeline version and the tags that you set when launching the pipeline.
The directory structure will comprise of hashes which match those displayed in Nextflow and which uniquely identify specific jobs launched by Nextflow. For example:
gs://my-org-nextflow/test-runxyz/nf-work/duet-1-0.1.0_2021-05-15_1656_5bp/01/af2b9a7434a6ca435b96c6b84cb9a2
This directory is useful for debugging the pipeline, examining logs and viewing the STDOUT from jobs. Inside this directory there will be subdirectories associated with each pipeline execution.
If the pipeline is running smoothly on your runs, you will rarely need
to look in the nf-work
directory. If a pipeline has completed
successfully and you have no intention of resuming it with modified
settings or examining logs, you can delete the contents of the
associated subdirectory inside the nf-work
directory.
nf-results folder#
Subdirectory |
Contents |
---|---|
|
Sample-level and multi-sample reports summarising information about the samples and controls |
|
Primary data files generated by the pipeline (described in more detail below) |
|
BAM files and quantification files associated
with the methylated lambda and unmethylated
pUC19 controls. These small files are analogous
to the BAM files and quantification files
generated for your samples and may be useful
for familiarising yourself with the file
formats. Note that there is an accompanying
FASTA file for the controls in the reference
file directory with the following
name/location:
|
|
Secondary outputs from the pipeline, including a parameters log recording the parameters that were used to execute the pipeline, more extensive metrics to support more detailed data investigations, and the interim resolved FASTQ files that were passed into the aligner |
Key outputs#
reports/summary_reports: aggregate QC reports of all samples
reports/sample_reports: HTML QC reports per sample
sample_outputs/bams: Deduplicated BAM files
sample_outputs/modc_quantification: quantification files reporting modified cytosines
sample_outputs/variant_call_files: VCF files
For a more detailed overview of the duet pipeline outputs and explanation of the formats, please see the Bioinformatics data interpretation guide.
Removing data from the temporary pipeline folders#
The temporary pipeline folders created by the duet pipeline can be
removed upon successful completion of the pipeline run. If a pipeline
has completed successfully and you have no intention of resuming it with
modified settings or examining logs, you can delete the contents of the
associated subdirectory inside the nf-work
directory. Please see
Output directory overview
or more details about the pipeline folder structure.
Checking your set up: Performing test runs#
Before analysing your own data, it’s important to confirm that the pipeline is correctly installed and that your environment has been set up properly. The test runs below ensure that:
The duet pipeline and CLI are working as expected
Reference files and containers are accessible
Your compute environment has appropriate resources and permissions
Any issues with configuration, dependencies, or file paths are identified early
Run biomodal test#
Important
You should run the following tests immediately after completing
biomodal init
. If the test fails, do not proceed with analysing
your own data.
After biomodal init
completes without errors, run the
following to verify your setup:
biomodal test
Example output at the end of the test run:
Completed at: 10-Mar-2023 12:51:33
Duration : 45m 15s
CPU hours : 10.0
Succeeded : 21
If the test fails:
Check your internet connection, especially if you’re using cloud resources
Verify file paths in your
config.yaml
are correct and accessibleCheck authentication: rerun
biomodal auth
if credentials may have expiredInspect logs in the
nf-work
directory for errors
Reach out to support@biomodal.com with the test output and any relevant log files.
Run an example analysis with a test dataset (RunXYZ)#
The RunXYZ dataset is a small, pre-configured test dataset provided by biomodal as part of the pipeline installation. It includes a minimal set of sample files and metadata to simulate a typical pipeline run. Running this test helps you verify that:
Your environment (local or cloud/HPC) is configured correctly
Required dependencies (e.g. Docker/Singularity, Nextflow) are working
Reference data and container images are accessible
Your CLI config (e.g.
config.yaml
) andnextflow.config
are validFile paths and permissions are correctly set up
This test typically takes 30-60 minutes depending on your environment. It requires significantly fewer resources than a full pipeline run and does not generate large intermediate or final files.
To run the test, enter the following command:
biomodal analyse \
--input-path /biomodal/data_bucket/test-runxyz \
--meta-file CEGX_Run_meta.csv \
--output-path /biomodal/data_bucket/test-runxyz \
--run-name CEGX_RunXYZ \
--tag CEGX_RunXYZ \
--additional-params lib_prefix=CEG9330132-19-01_S12_L001 \
--mode 5bp
Once completed, check the nf-results
directory for output
files and reports. This confirms that your system is ready to analyse
your own data.
Please note that this manual test script will allow resuming the test analysis if/when it is interrupted.
Important
Make sure biomodal init
has completed before running this
command. If the dataset wasn’t downloaded or configured correctly,
re-run biomodal init
.
Test with larger GIAB dataset (Optional)#
To simulate real-world conditions, biomodal recommends testing the pipeline with a larger publicly avaiable duet dataset from Genome In A Bottle (GIAB).
Important
This test will incur additional runtime and/or storage costs, particularly if run on cloud infrastructure. Ensure that your cloud environment has sufficient resources and budget allocated before starting the GIAB test.
Download the GIAB dataset (either +modC or evoC) which can be found here: https://biomodal.com/kb_articles/demo-data-instructions/
Run the pipeline:
biomodal analyse \
--input-path <path_to_giab_data> \
--meta-file CEGX_Run_meta.csv \
--output-path <path_to_output> \
--additional-profile deep_seq \
--tag giab_demo_data \
--mode <5bp|6bp>
The mode selected should be 5bp
or 6bp
dependent on the demo data
selected.
If you have any issues, please contact support@biomodal.com.
Troubleshooting and optimisation#
This section outlines common issues and performance optimisation strategies. If you have any questions or need further help, please reach out to support@biomodal.com.
Reduce disk usage#
The output footprint of a full pipeline run is typically ~1.4x the size of the input FASTQ files. You can reduce this by:
Disable publishing of resolved FASTQs#
Resolved FASTQ files are large (>50% of output) and not required for downstream analysis. To avoid publishing them:
--additional-params publish_by_default.prelude=false
Output CRAMs instead of BAMs#
The duet pipeline outputs sequence alignments as BAM files by default but offers the option to output genome sequence alignments as CRAM files. CRAMs will be outputted with a reference genome FASTA (which is required to read CRAMs). CRAMs save ~40% disk space compared to BAMs. To publish CRAMs:
--additional-params publish_crams=true
Resuming a disrupted pipeline#
If your run was interrupted, rerun the exact same command with:
--resume
This might be useful if your terminal session is accidentally terminated, such as by a break in network connectivity. Please ensure that you use the same parameters as the previous pipeline when resuming.
Low GCP CPU quotas#
If using GCP with limited CPU quotas (e.g. free trial accounts), please contact support@biomodal.com as a custom resource profile may be needed.
Orchestration VM#
Your orchestration VM only needs to run when launching pipelines. You can stop it from the clound providers console when inactive, and start again when you need to launch another analysis run.
Updating the biomodal pipeline#
This section explains how to check for new versions of the duet pipeline and update both the CLI and D(h)MR workflow. For more information on the D(h)MR calling workflow, please see here.
Important
Updating the pipeline may require super-user admin privileges depending on your system setup. Please check with your system administrator if you are unsure.
Using the CLI to update the pipeline#
To check for available duet pipeline versions, run the following:
biomodal list
This will list all available versions of the duet pipeline software. Please be aware that some may be older than your current local version.
To install a newer version of the duet pipeline software please run:
biomodal download <version>
This updates containers, reference files, and test data for the selected version. For more information, please see Import the pipeline with biomodal init.
Note
This will overwrite the local files for the specified version. Other installed versions will not be affected.
You can also check for available updates with:
biomodal info
To directly update to the latest version of duet, and change your local settings, you can run:
biomodal init
This updates any updated containers, reference files, and test data for the selected version. For more information, please see Import the pipeline with biomodal init.
Manual update of the CLI and pipeline#
You can manually update to CLI 1.1.3 and duet 1.4.2, or newer:
Backup your installation folder
Download and unzip the new CLI version (See Download the installation files)
Copy the following files from
./cli
into your local installation directory:_biomodal_functions.sh _biomodal_validate.sh biomodal
Update
failfast.error.config
andretry.error.config
with new settings.
HPC version
failfast.error.config:
process {
errorStrategy = { sleep(Math.pow(2, task.attempt) as long); return task.exitStatus in [14] ? 'retry' : 'terminate' }
}
retry.error.config:
process {
errorStrategy = { task.attempt.toString().isInteger() ? sleep(Math.pow(2, task.attempt) as long) : sleep(1) ; return task.exitStatus in [14] ? 'retry' : 'ignore' }
maxRetries = 10
}
Cloud version
failfast.error.config:
process {
errorStrategy = { task.attempt.toString().isInteger() ? sleep(Math.pow(2, task.attempt) as long) : sleep(1) ; return task.exitStatus in [0,10,14,50001,50002,50003,50004,50005,50006] ? 'retry' : 'terminate' }
maxRetries = 1
}
retry.error.config:
process {
errorStrategy = { task.attempt.toString().isInteger() ? sleep(Math.pow(2, task.attempt) as long) : sleep(1) ; return task.exitStatus in [0,10,14,50001,50002,50003,50004,50005,50006] ? 'retry' : 'ignore' }
maxRetries = 10
}
Make the scripts executable:
chmod +x biomodal _biomodal_functions.sh _biomodal_validate.sh
Download the latest duet version
biomodal download 1.4.2
Alternatively, Run biomodal init
to update your local preferences and
verify that you have all the required container images and reference files
Edit your
config.yaml
to add:
ref_pipeline_version: 1.0.1
Add the following to
nextflow.config
:
report {
overwrite = true
}
Run a test to validate the update:
biomodal test
You should now be ready to begin using the updated software.
Advanced analysis features#
CHG and CHH modification calling#
By default, the pipeline calls modified cytosines in CpG context only. However, the pipeline includes the capability to call modified cytosines at CHG and CHH sites (where H is the IUPAC disambiguation code representing a DNA nucleotide other than G). To invoke CHG and CHH modification calling, add the following flag when launching the pipeline:
--chg-chh-contexts true
Mapping Quality Filtering#
By default, the duet pipeline discards reads from the BAM file with a MAPQ (Mapping Quality) score of 0. However, you can adjust the stringency of this filter as needed.
To include all reads regardless of mapping quality:
--additional-params bamlet.min_map_qual=0
To retain only reads in the BAM with MAPQ >=20, thereby discarding lower-confidence mappings:
--additional-params bamlet.min_map_qual=20
The pipeline also supports more stringent MAPQ filtering specifically for reads used in epigenetic quantification. For example, to keep reads in the BAM with MAPQ >=1 but restrict epigenetic quantification to reads with MAPQ >=20:
--additional-params bamlet.min_map_qual=1,epiquant.min_map_qual=20
Epigenetic Quantification Options#
Epigenetic Quantification Output Formats#
In addition to the zarr store, the duet pipeline is able to produce four types of plain-text epigenetic quantification outputs; bedmethyl, cxreport, bismark or bedgraph. The default is to produce cxreport, but any combination can be requested. To produce cxreport and bismark:
--quantification-output cxreport,bismark
Limit epigenetic quantification to primary chromosomes#
By default, the duet pipeline will quantify epigenetics at each CG
context present on contigs included in the primary assembly BED (if the
species has an available primary assembly BED). In GRCh38
Human
reference this includes unlocalised scaffolds
(e.g. chr1_KI270706v1_random). To exclude these scaffolds from
epigenetic quantification (note they will still be present in the BAM):
--additional-params epiquant.primary_chromosomes_only=true
Include SNVs in CG coverage#
By default, the pipeline counts total coverage at a reference CG position as the number of Cs (modified or unmodified) at a given position. However, you can optionally include A, G and T base calls in total coverage:
--additional-params modality.export.only_c_in_coverage=false
Note the total coverage in methylation quantification outputs will be used as the denominator to calculate percenatge methylated/hydroxymethylated.
Output uncovered CG contexts#
By default any reference CG contexts which are uncovered in a given sample will be omitted from the plain-text quantification outputs. To include zero-coverage CG contexts in quantification outputs:
--additional-params modality.export.skip_rows_zero_counts=false
Variant Calling and Variant Associated Methylation#
The duet pipeline contains modes for calling germline variants using GATK HaplotypeCaller and for calling somatic variants using the GATK Mutect2.
Additionally, an Allele Specific Methlation (ASM) module can be activated which will evaluate each heterozygous germline variant for the presence of methylation differences between the two alleles.
Germline Variant Calling#
By default, germline variant calling is performed on each sample individually using GATK HaplotypeCaller generating a separate VCF file for each sample.
Germline variant calling can be disabled by setting the following parameter when launching the pipeline:
--additional-params call_germline_variants=false
Joint Variant Calling#
The pipeline also supports joint variant calling across multiple samples. Joint variant calling can be activated by adding the following parameter when launching the pipeline:
--use-gvcf true
This will cause the GATK HaplotypeCaller to generate per-sample gVCF files (instead of VCF files) which will then be consolidated via GATK GenomicsDBImport followed by joint genotyping using GATK GenotypeGVCFs.
Allele Specific Methylation#
The Allele Specific Methylation (ASM) module can be activated to evalute each heterozygous germline variant for the presence of methylation differences between the two alleles. The ASM module is disabled by default, but can be activated by adding the following parameter when launching the pipeline:
--compute-asm true
To perform ASM calling, variants will be ‘atomized’ using bcftools norm so that complex variants are decomposed - i.e. MNVs will be split into consecutive SNVs. An ASM file will be generated for each sample.
If the --compute-asm true
parameter is combined with the
--use-gvcf true
parameter described above, an ASM file will still be
generated for each sample, but ASM calling will be performed using the
atomized outputs from joint variant calling. A subsequent step will
combine per-sample ASM calls into a single file.
Somatic Variant Calling#
The pipeline also supports somatic variant calling which is performed using GATK Mutect2 followed by GATK-recommended filtering using FilterMutectCalls. This is performed in the “tumour-only” mode of Mutect2 (i.e. it is not run using pairing of a tumour / matched normal).
Somatic variant calling can be activated by adding the following parameter when launching the pipeline.
--additional-params call_somatic_variants=true
The filtering of somatic variants using FilterMutectCalls happens by default if somatic variant calling is activated, but filtering can be disabled by adding the following parameter when launching the pipeline:
--additional-params filter_somatic_variants=false
Subsampling#
If you wish to subsample the reads in your sample prior to processing (such as to ensure equal numbers of reads in each sample for inter-library comparison purposes), then this is possible by adding the following parameter and indicating the maximum number of reads that you want to retain on a single lane for a single sample:
--additional-params subsample=50000000
Important
This subsampling is performed per sample, per lane, so in the above example, if there were 4 lanes, then this would limit the processing to 50M reads per sample per lane, so 200M reads in total per sample.
Alternative Supported Reference Genomes#
By default the duet pipeline will align against a reference file containing the human reference genome GRCh38 and the sequences of the spike-in controls supplied with the kit. It is also possible to run the pipeline relative to other available reference genomes (e.g. mouse) first by downloading the preferred reference data from biomodal:
biomodal reference list
biomodal reference download GRCm38p6
Then the following flag can be used when launching the pipeline:
--reference-genome GRCm38p6
Alternatively you can create your own reference genome by following the instructions in the make reference genome documentation.
Supported Mouse References#
The duet pipeline supports three versions of the mouse reference; GRCm38 and GRCm38p6 and GRCm39.
GRCm38 (2012): original GRCm38 with additional patch fix contigs, resulting in a large reference (>12 Gbp) requiring more resource to process data. Only use if your study requires detailed patch-level alignment. Note that reads aligning to contigs which are not present in the primary assembly bed (including patch fix contigs) will be discarded by the duet pipeline.
GRCm38p6 (2017): final patch version of GRCm38, incorporating cumulative patch updates into the primary assembly. Use this reference for studies requiring compatibility with legacy GRCm38 tools and datasets.
GRCm39 (2020): latest mouse genome assembly without legacy patches.
Processing individual samples#
Due to resource constraints, there may be a need to process individual
samples in the duet pipeline. This will require a smaller amount of CPU
and RAM resources as you will analyse each sample separately. You can
use the lib_prefix
parameter when executing the duet pipeline to
specify the prefix of the library to process. You can add the prefix to
the --additional-params
parameter to your biomodal analyse
command as per this example:
--additional-params lib_prefix=SAMPLE1_L001
Enable generation of the FASTQC report#
By default the FASTQ report is not generated in the duet pipeline. To
enable generation of the report on your raw FASTQ files, please add the
following parameter to biomodal analyse
when launching the pipeline:
--additional-params run_by_default.fastqc.raw=true
Target enrichment#
Overview#
The biomodal duet pipeline includes a mode for processing data generated using targeted panels. The analysis pathways for processing targeted libraries are slightly different because the pipeline calculates targeted metrics and makes use of additional reference files describing the bait and target regions associated with the target panel being used.
Processing data using a supported target panel#
Relevant reference files for the following target panels are available by default:
Twist Pan-cancer Methylation Panel
Twist Human Methylome Panel
A targeted analysis for either of these panels can be achieved by
passing the following parameters to the biomodal analyse
command.
Parameter |
Description |
---|---|
|
Set to |
|
For the Twist Alliance Pan-cancer
Methylation Panel set to:
|
Here is an example biomodal analyse
command with these additional
parameters passed in:
biomodal analyse \
--input-path [path_for_input_files] \
--output-path [path_for output_files] \
--meta-file [meta_file_name] \
--run-name [name_for_run] \
--tag [date_time_or_other_tag] \
--additional-profile deep_seq \
--targeted true \
--targeted-panel twist_cancer
For any other target panels, it is still possible to perform a targeted analysis, but it is necessary to generate relevant reference files first. The remaining instructions describe how to generate the necessary reference file for alternative target panels and how to launch the pipeline in these circumstances.
Processing data using alternative target panels#
Three additional files are required when running the duet pipeline in
targeted mode with alternative target panels. These are the bait and
target interval files and a BED file containing the target coordinates.
The target and bait intervals are used by gatk CollectHsMetrics
to
calculate target metrics including target coverage and target
enrichment. The BED file is used to filter methylation and variant calls
down to the targeted regions of interest.
Generating bait and target interval files#
Bait and target interval files are generated from BED files containing
bait and target coordinates with the tool picard BedToIntervalList
as shown below:
# Make probes intervals
picard BedToIntervalList I=probes.bed \
O=reference_probe.intervals \
SD=reference.dict
# Make targets intervals
picard BedToIntervalList I=targets.bed \
O=reference_targets.intervals \
SD=reference.dict
Note that this command should be run independently of the biomodal duet multiomics solution CLI in an environment with picard available.
This process requires an additional genome dict file which is created from a FASTA file with your reference genome of interest:
picard CreateSequenceDictionary \
R=reference.fasta \
O=reference.dict
If you are working with the reference files provided with the pipeline, then you can obtain a genome dict file from the following locations in the reference file directory:
Human:
duet/duet-ref-0.1.0/reference_genomes/GRCh38Decoy/GRCh38Decoy.dict
Mouse: duet/duet-ref-0.1.0/reference_genomes/GRCm38p6/GRCm38p6.dict
Launching the duet pipeline in targeted mode#
You should use the same profiles and commands when launching the duet
pipeline in targeted mode as you would when running the duet pipeline in
non-targeted mode, except that there are additional parameters and file
paths that need to be supplied as key-value pairs using the
--additional-params
option when running the pipeline in targeted
mode with UMIs.
Parameter |
Description |
---|---|
|
Set to |
|
The path to the bait interval file generated from the steps described above |
|
The path to target interval file generated from the steps described above |
|
The path to bed file with bait coordinates |
These additional parameters should be supplied as key-value pairs to the
biomodal analyse
command when launching the pipeline by using the
--additional-params
option, as in the following example:
biomodal analyse \
--input-path [path_for_input_files] \
--output-path [path_for output_files] \
--meta-file [meta_file_name] \
--run-name [name_for_run] \
--tag [date_time_or_other_tag] \
--additional-profile deep_seq \
--targeted true \
--additional-params targeted_bait=[path_to_file],targeted_target=[path_to_file],targeted_bait_bed=[path_to_file]
Automation#
You can automate the auth
and init
stages of the biomodal CLI
for CI and test purposes.
“biomodal auth” stage#
When running a biomodal auth
command you can optionally pass in
username and password credentials using the command line flags below:
auth Login to biomodal
-u | --username <email> username
-p | --password <password> password
--password-stdin Take the password from stdin (recommended)
Username and passwords can also be passed in as environment variables
BIOMODAL_USERNAME
and BIOMODAL_PASSWORD
. Environment set
variables will always take precedence over CLI applied variables, so
please bear this in mind if using this method.
“biomodal init” stage#
When re-running the biomodal init
command you can skip prompts for
setting your temp directory and init folder location by setting an
environment variable IGNORE_INIT_PROMPTS=true
. Please note that you
have to either edit the config file or follow the prompts the initial
time the script is run.
You can also do this by adding ignore_init_prompts: true
to your
config.yaml
file.
You can also skip biomodal CLI prompts for error strategy by setting an
error strategy in your config.yaml
file.
Normally you’ll be prompted to choose either a normal
or
failfast
error strategy. If you set either one of these values in
your config.yaml
file the biomodal CLI will skip this prompt:
error_strategy: <normal|failfast>
Alternatively you can set the environment variable ERROR_STRATEGY
to
normal
or failfast
.
Finally you can also define whether to share events and metrics reports
by either setting the SHARE_METRICS
environment variable to true
or false
, or by setting the value directly in the config.yaml
file:
share_metrics: <true|false>
If neither of these are set, the biomodal CLI will require user input as to whether to set these values.
Note Settings in the config.yaml
file will take precedence over
environment variables. If these values are both set but there are
disparities between the two, you may experience unexpected results.
Reference genome pipeline#
You can download the reference genome pipeline from the biomodal CLI. The reference genome pipeline is used to create new reference genomes for the duet pipeline. The reference genome pipeline is a separate pipeline from the duet pipeline and is not required to run the duet pipeline using the default reference genomes provide by biomodal.
Checking available reference pipeline versions#
The reference pipeline is not downloaded during a new biomodal software install by default. To download the files necessary to run the reference pipeline, compatible version(s) of the reference pipeline must be downloaded. To view the list of reference pipeline versions available/compatible that can be downloaded in the next step, run the following command:
biomodal reference pipeline list
Reference pipeline download#
This step will download and set up the necessary files and scripts to run the reference pipeline.
To download a new version of the reference genome pipeline, run the following command:
biomodal reference pipeline download <version>
The version number should be the version you want to download. You can
find the available versions by running the
biomodal reference pipeline list
command.
Please note that if you have already downloaded a version of the reference genome pipeline, downloading an update of the duet pipeline will download updates the reference pipeline should this be required.
Downloading this pipeline requres a new parameter entry
ref_pipeline_version
in the config.yaml
file. This will be
automatically added when you download the reference pipeline. Please
note that the reference pipeline version is not the same as the duet
pipeline version.
Make new reference genomes#
Running the reference genome pipeline#
To make new reference genomes using the reference genome pipeline, run the following command:
biomodal reference make
--input-path <your-input-path>
--output-path <your-output-path>
--species <reference-genome-species>
--reference-genome <reference-genome-official-name>
Parameter |
Description |
---|---|
|
Path for the reference genome gzipped FASTA file |
|
Custom output disk location or bucket url |
|
Required: reference species,
e.g. |
|
Genome Reference Consortium
official name,
e.g. |
Please note that all parameters are required to run this pipeline.
The reference pipeline will not write directly to the central duet
pipeline reference location, but will write to a staging location
specified in the --output-path
parameter. This enables manual
verification of the output before moving it to the reference location.
And it prevents the pipeline from needing write access to the reference
location.
Reference genome input and output files requirements#
Reference genome input file requirements#
To run the pipeline on a new species, it is necessary to generate new reference files by running the reference file pipeline. This takes a gzipped FASTA file for the reference genome.
With the exception of the new FASTA reference genome file, all of the files above are supplied with the duet pipeline and are assumed to have been downloaded and unpackaged according to the instructions supplied. These files will be automatically detected when the Alternative Reference Genome Pipeline is run.
The reference genome for the species required should be obtained from an appropriate source, e.g. the Genome Reference Consortium or by searching the Ensembl or EnsemblPlants genomes database.
When you have downloaded the gzipped FASTA file, please provide the full
location and filename in the --input-path
parameter. The pipeline
will automatically detect the file and use it to generate the reference
genome.
Some further notes about preparing the FASTA reference genome:
Some sources supply reference genomes in FASTA format but not gzipped, for example inside a zip bundle with some metadata files. In these cases, you will need to gzip your FASTA file before you can use it as an input to the reference genome pipeline by running
gzip /path/to/my/reference.fasta
The contig names inside the FASTA file (the lines that start with the
>
character) are important since they are used inside the duet pipeline. Sometimes these contigs have long names; so that the duet pipeline can run effectively, you will need to either rename these contigs to match those you add to the genome config file or supply amapping
to map the long name to the names you supply in the config. For example, your contig name might be>NC_044371.1 Cannabis sativa chromosome 1, cs10, whole genome shotgun sequence
and you might want to rename it to>chr1
so that you can just addchr1
to the list of contigs in the genome config file.Make sure the contigs in the reference FASTA contain what you expect:
Not all reference downloads contain the mitochondrial chromosome, which might be important for your analysis.
There may be other contigs that are needed for your reference that are downloaded separately, such as the chloroplasts for a plant genome.
Many references contain unmapped scaffold contigs; these may be important but should probably be excluded from the list of contigs you supply in the genome config file.
You can inspect the contigs by running
grep ">" /path/to/my/reference.fasta
.
Reference genome output file locations#
The output files will be stored in the location specified by the
--output-path
parameter above. Here you will find the following top
level sub-directories:
Location |
Description |
---|---|
|
Reference genome output |
|
Controls output |
In the reference genome output location, the files are:
Location |
Description |
---|---|
|
Combined reference genome and controls FASTA |
|
Combined reference genome and controls FASTA index |
|
BED file containing reference genome contig definitions |
` <reference-genome>.chrom.sizes` |
List of reference genome contig sizes |
|
List of reference genome contig names |
|
List of both reference genome and controls contigs in dictionary format |
|
Primary assembly BED for reference genome. This will be exactly the same as the reference genome bed, except for human, mouse or zebrafish |
After the pipeline has run successfully, the reference genome files, and
a set of existing couplet q-tables files and controls, should be copied
to the shared reference genomes folder where the existing genomes are
held. This is the same location as the default reference genomes
provided when you installed the duet pipeline.
(<default bucket/folder location of your pipeline data>/reference_files/
).
Example using GCP storage bucket:
# Set the relevant reference genome version
REF_VERSION=1.0.1
# Copy the reference genome files to the shared reference genomes folder
gcloud storage cp --recursive \
"gs://bucket-name/reference-output-folder/reference_genomes/GRCz11/*" \
gs://bucket-name/reference_files/${REF_VERSION}_GRCz11/duet/duet-ref-${REF_VERSION}/reference_genomes/GRCz11/
# Copy existing couplet qtables to the new reference genome folder
gcloud storage cp --recursive \
gs://bucket-name/reference_files/${REF_VERSION}_GRCh38Decoy/duet/couplet \
gs://bucket-name/reference_files/${REF_VERSION}_GRCz11/duet/
# Copy existing controls to the new reference genome folder
gcloud storage cp --recursive \
gs://bucket-name/reference_files/${REF_VERSION}_GRCh38Decoy/duet/duet-ref-${REF_VERSION}/ss_ctrls \
gs://bucket-name/reference_files/${REF_VERSION}_GRCz11/duet/duet-ref-${REF_VERSION}/
The controls output files are:
Location |
Description |
---|---|
|
BED file containing long control contig definitions |
|
BED file containing long control contig names |
|
BED file containing long control contig sizes |
|
BED file containing short control contig definitions |
` ss-ctrls-short-v24.chrom.sizes` |
BED file containing short control contig sizes |
|
BED file containing short control contig names |
Note that the controls output files will be exactly the same as those already provided with the duet pipeline.
The original input files are also copied to the reference genome/controls subdirectory, as applicable.
Reference genome config file#
Once the pipeline has been run successfully and outputs checked, the
reference genome directory
(i.e. <output-path>/references_genomes/GRCz11
) should be copied to
the reference_genomes folder where the existing genomes are held.
The CLI will generate an config file template. The alternative reference can be used by providing a set of specified fields to the pipeline in the correct format, for example:
params {
genomes {
'<<ref_genome>>' {
contigs = ''
mapping = false
autosomes = ''
mitochondria = ''
primary_assembly = true
primary_assembly_bed = "<<ref_genome_path>>/duet/duet-ref-<<ref_pipeline_version>>/reference_genomes/<<ref_genome>>/<<ref_genome>>_primary_assembly.bed"
target_intervals = false
species = '<<species>>'
}
}
}
This reference genome template file should be edited to include the correct sections for the new reference genome. For any new reference genome, you will need to define these parameters as follows:
Field |
Description, Usage and Optionality |
---|---|
|
A list of chromosomes. This is used: (1) during methylation quantification as a specification of the primary chromosomes over which summary methylation quantification metrics should be calculated: mean CpG methylation rates will be calculated for each chromosome in this list; (2) during variant calling as a specification of the ‘large’ primary assembly contigs. This information is used to support parallelisation of variant calling processes: variant calling will be performed separately in parallel for chromosomes in this list. Variant calling on any other contigs (such as alt contigs or decoy sequences) will be performed in one further parallel step. |
|
A regular expression that can be
specified to combine methylation
quantification summary statistics
from contigs that have a common
prefix. If the regular expression
matches a contig name, then
quantification statistics from
that contig will be attributed to
the first matching group returned
when the regular expression is
applied to the contig name. For
example, the regular expression
|
|
A list of the autosomes. This is used during methylation quantification to specify the chromosomes that should be used when calculating genome-wide methylation levels. This excludes the allosomes as they would cause a sex-bias in the calculation of genome-wide methylation levels. |
|
The name of the mitochondrial chromosome in the reference genome. This is used during methylation quantification to identify the contig over which the mitochondrial methylation rate will be calculated. |
|
A Boolean field indicating whether there is a separate subset of contigs that are considered to be the ‘primary assembly’. This should be set to “false” as a distinct primary assembly is not currently generated for alternative reference genomes. |
|
Binomial name for the species. |
|
The path to a primary assembly bed file if relevant. Usage is described above. |
|
Reference Genome Consortium name supplied when launching the reference pipeline |
|
Species supplied when launching the reference pipeline |
|
Automatically applied when launching the reference pipeline. Location of the final reference genome files. This is the shared location expected by the duet pipeline |
|
Automatically applied when
launching the reference pipeline,
additionally stored in the CLI
|
The filename will be in the same data-bucket location as for other CLI
results and will have the name of the chosen reference genome. Example
is Ptrichocarpa_v4-1.config
if you launched the pipeline with the
reference genome Ptrichocarpa_v4-1
.
Module hardware requirements#
Please see the complete overview of module requirements for the reference pipeline.
The module MAKE_BWA_MEM2_INDEX
may require a large amount of memory
but can be run on a single core. Looking at the
BWA_MEM2 documentation, the
recommendation is to allocate about 28 x the genome size. For example,
the human genome is around 3 gigabases, and therefore requires 3 x 28 =
~84GB RAM. Note that this figure needs to include any additional decoy
contigs that are part of the reference fasta you are using.
You may have to add a resource section in the nextflow.config
file
in the biomodal script folder to specify the hardware requirements for
the MAKE_BWA_MEM2_INDEX
module in the reference pipeline.
For example:
process {
withName: 'MAKE_BWA_MEM2_INDEX' { cpus = 1
memory = "320GB"}
}
Using new reference genomes in the duet pipeline#
To use this new reference genome with the biomodal duet +modC evoC
multiomics solution bioinformatics pipeline, please include the
following parameter to the biomodal analyse
command:
--reference-genome <reference-genome>
--reference-genome-profile <reference-genome-profile-file-path>
The --reference-genome
parameter should be set to the name of the
reference genome you have created. The --reference-genome-profile
parameter should be set to the path of the reference genome profile you
have created. The default location of this file is in the output
directory you specified for the biomodal make reference
command.
D(h)MR calling workflow#
Overview#
The D(h)MR calling workflow is used to identify differentially methylated or differentially hydroxymethylated regions between groups of samples analysed using the duet pipeline
The D(h)MR calling workflow uses the Zarr store output from the duet pipeline as an input. It can also take multiple Zarr stores from different executions of the duet pipeline (e.g. from processing different sequencing runs) as an input
The D(h)MR calling workflow requires a sample sheet listing the samples to be used for D(h)MR calling, the condition associated with each sample, and information about covariates that should be taken into account
Samples are grouped by condition and DMR calling is performed between all pairwise combinations of conditions
A typical example might be to call DMRs between a set of cancer samples and a set of healthy samples, accounting for sex and age as covariates
With duet +modC, DMRs are called based on modC (modC is the union of mC and hmC modifications)
With duet evoC, mC is distinguishable from hmC, and DMR calling is performed separately for mC modifications and for hmC modifications. It is also possible to call DMRs based on modC - i.e. on the union of mC and hmC calls
Running the D(h)MR calling workflow#
biomodal call_dmr Run D(h)MR analysis
call_dmr Run D(h)MR analysis
--input-path <input_path> Required: full custom bucket uri or directory path with Zarr stores to be analysed
--dmr-sample-sheet <sample_sheet_name> Required: full path and name of the DMR sample sheet
--output-path <output_path> Required: custom output disk location or bucket uri
--tag <tag> Required: custom tag to identify this DMR run
--mode <5bp|6bp> Required: 5 base vs 6 base mode, default is '5bp'
--condition <condition(s)> Required: a single DMR sample sheet column that contains the conditions between which to call DMRs
--covariates <covariates> Optional: one or more DMR sample sheet columns that contain covariates to account for during DMR calling
--dmr-bed-path <dmr_bed_path> Optional: a path to a bed file defining regions that DMR calling should be restricted to
--evoc-modifications <mc|hmc|modc> Optional: single or comma separated list of duet evoC modifications to call, 'mc' or 'hmc' in 6bp mode only
--min-depth <min_depth> Optional: contexts will only be removed if coverage <= min-depth in ALL SAMPLES, default is '0'
--window-size <window_size> Optional: window size for DMR analysis, default is '1000'
--additional-params <params> Optional: additional single or comma separated parameters, e.g. 'param1=value1,param2=value2'
--run-name <run_name> Optional: custom run name to identify this analysis run
--resume Optional: resume previous run using successfully completed tasks, default is 'false'
--work-dir Optional: override the default path where the workflow temporary data is stored
Small example:
biomodal call_dmr \
--input-path gs://my-bucket-name/zarr_files/ \
--output-path gs://my-bucket-name/zarr_files/output \
--dmr-sample-sheet gs://my-bucket-name/zarr_files/dmr_sample_sheet.tsv \
--dmr-bed-path gs://my-bucket-name/zarr_files/my_bedfile.bed \
--mode 5bp \
--condition disease_state \
--tag dmr_run1 \
--run-name dmr_run1
D(h)MR calling workflow input and output file requirements#
D(h)MR calling workflow input file and parameters#
Path to directory containing your Zarr file(s)
This should be a path to a directory containing the multi-sample Zarr store that got generated from running the duet pipeline, i.e. the
sample_outputs/zarr_store/
subdirectory of duet pipeline outputsThis can also be a directory where you have brought together several multi-sample Zarr stores generated from running the duet pipeline (for instance if you have separately processed multiple sequencing runs)
The Zarr store is a multi-sample, multi-dimensional, compressed, indexed data store.
Single type of multi-sample Zarr store for each workflow run
Please ensure that the input path includes only Zarr stores of a single cytosine context (CG, CHG, or CHH) before running the call_dmr workflow. Separate Zarr stores of different cytosine contexts into distinct folders or paths as needed.
DMR sample sheet
The DMR sample sheet should be a tsv file with a header, and with one row per sample that you want included in the DMR calling
The columns should include a
sample_id
column, a column featuring the conditions between which you wish to call DMRs, and (optionally) columns for any covariates associated your samples. DMR calling will be performed between each pairwise comparison of conditionsPlease note that
group
cannot be used as a column name as this is a reserved wordHere is an example of a sample sheet that would be suitable for calling DMRs between healthy and cancer samples, accounting for sex and age as covariates:
sample_id disease_state sex age
BIO-A1 healthy M 62
BIO-A2 healthy M 58
BIO-A3 healthy F 66
BIO-A4 cancer_stage1 F 66
BIO-A5 cancer_stage1 M 65
BIO-A6 cancer_stage1 M 55
BIO-A7 cancer_stage4 F 61
BIO-A8 cancer_stage4 F 59
Using the above example, the parameters
--condition disease_state –covariates
sex, age DMR calling would be performed sequentially for “healthy vs cancer_stage1”, “healthy vs cancer_stage4”, and “cancer_stage1 vs cancer_stage4” using “sex” and “age” as covariates
DMR bed path
Should be a path to a bed file containing regions that DMR calling should be performed on. DMR calling will be performed between each condition for each region listed in the bed file,
--dmr-bed-path
cannot be used in conjunction with--window-size
. For instance, you may wish to restrict DMR calling to a set of promoters that you are interested inBED format is zero-based and half-open, meaning the first position on contig is 0, and the End coordinate is excluded from the interval.
Note
Some methylation quantification outputs from the duet pipeline (e.g. cxreport) have 1-based positions.
chr1 1000171 1001137 chr1 1019119 1020119 chr1 1172879 1173879 chr1 1232030 1232236 chr1 1279435 1280435 chr1 10031831 10032831 chr1 10209569 10210569 chr1 10397591 10398591
If you wish to perform Differentially Methylated Cytosine (DMC) analysis, i.e. single CpG regions, you will need to pass a BED which for the regions are single CpGs.
Note
The inputted DMR BED cannot contain more than 5 million intervals
chr1 10031845 10031846
chr1 10031846 10031847
chr1 10031915 10031916
chr1 10031916 10031917
chr1 10031917 10031918
chr1 10031918 10031919
chr1 10031925 10031926
chr1 10031926 10031927
chr1 10031947 10031948
chr1 10031948 10031949
chr1 10031949 10031950
chr1 10031950 10031951
Window size
If a
dmr-bed-path
is not supplied, DMRs will be called genome-wide. The genome will be split intowindow-size
bp (default = 1000) regions to perform DMR analysis on.Note
If a DMR BED is supplied,
window-size
will be ignoredReducing the window size for genome-wide DMR calling increases the runtime and memory requirement. The minimum supported window size is 200bp
evoC modifications
In 5bp mode (duet +modC) DMRs can only be called on modC. In 6bp mode (duet evoC) DMRs will be called on mC and hmC separately by default. However, it is possible to call DMRs on any combination of mC, hmC or combined modC
To call hmC DMRs only at enhancer regions use
--dmr-bed-path full-path/to-your/enhancer-intervals.bed --evoc-modifications hmc
To call mC DMRs only at promotor regions use
--dmr-bed-path full-path/to-your/promoter-intervals.bed --evoc-modifications mc
To call mC, hmC, and combined modC DMRs on 500bp regions genomewide use
--window-size 500 --evoc-modifications mc,hmc,modc
Min Depth
If --min-depth
is supplied, CG contexts will be removed only
if coverage <= min-depth
in ALL SAMPLES. Note: this is
stranded CG coverage
D(h)MR calling workflow output file formats and locations#
D(h)MR output BED
Field |
Description |
---|---|
|
Intervals over which DMR is called |
|
Start position of the interval |
|
End position of the interval |
|
Reserved for future use |
|
Reserved for future use |
|
Reserved for future use |
|
Number of CpG contexts in the |
|
Mean CpG coverage across all samples |
|
Mean modification fraction in group 1 |
|
Mean modification fraction in group 2 |
|
Fold change between group 1 and group 2 |
|
Difference between |
|
Difference between the log likelihood of the full model (with group specified) and the null model (without group specified) |
|
Unadjusted p-value |
|
Adjusted p-value (Benjamini-Hochberg) |
If the above DMR samplesheet was analysed in 6bp, we’d expect the
following outputs to be generated in the --output-path
directory:
DMR_num_mc_healthy_cancer_stage1.bed
DMR_num_mc_healthy_cancer_stage4.bed
DMR_num_mc_cancer_stage1_cancer_stage4.bed
DMR_num_hmc_healthy_cancer_stage1.bed
DMR_num_hmc_healthy_cancer_stage4.bed
DMR_num_hmc_cancer_stage1_cancer_stage4.bed
D(h)MR output summary
An additional summary.txt
file is generated per output BED
file which lists metadata about the DMR including the version of the
biomodal software package used, the input files, and the full commands
used. It also reports the number of DMRs which pass a q < 0.05
significance threshold.
Appendix#
Additional available workflows (advanced users only)#
The nf-core community has several publicly available Nextflow pipelines. The nf-core pipelines are designed to be easy to use and are well documented. You can find a list of the nf-core pipelines at https://nf-co.re/pipelines.
nf-core pipelines may have a similar technical setup as your pipeline execution environment, so it is worth checking if there are relevant Nextflow pipeline settings and parameters you can apply to your Nextflow environment.
Glossary of terms#
Apptainer#
An open-source container platform designed for high-performance computing (HPC) environments, enabling users to package entire scientific workflows including software, libraries, and data, into a portable format.
BAM file (Binary Alignment Map)#
A binary version of a SAM file that stores aligned sequencing reads. Commonly used for efficient storage and analysis of sequence alignment data.
BaseSpace#
A cloud-based platform developed by Illumina for storing, analysing, and sharing genomic data generated by Illumina sequencers.
Command Line Interface (CLI)#
A text-based interface used to interact with software and operating systems by typing commands.
Conda#
An open-source package and environment management system that helps install and manage software dependencies, particularly in Python and R-based projects.
CPU (Central Processing Unit)#
The part of a computer that performs most of the processing. In bioinformatics, CPU power is important for running compute-intensive tasks.
CRAM file#
A compressed version of a BAM file that reduces file size by using reference-based compression. Used for efficient storage of sequencing alignment data.
D(h)MR (Differentially [hydroxy]Methylated Region)#
Genomic regions that show differences in DNA methylation or hydroxymethylation levels between samples or conditions.
Docker#
A platform that uses container technology to package software and its dependencies into isolated environments, enabling reproducible execution across systems.
DISK#
Refers to the amount of disk storage space required or allocated for computational tasks, such as storing large sequencing files.
FASTA file#
A text-based format for representing nucleotide or protein sequences. Each entry starts with a header line followed by the sequence.
FASTQ file#
A text format that stores both biological sequences (usually from next-generation sequencing) and corresponding quality scores.
GCP (Google Cloud Platform)#
A suite of cloud computing services by Google, often used for running scalable bioinformatics analyses and managing data.
High-Performance Computing (HPC)#
A computing environment that uses clusters of powerful computers to perform large-scale or resource-intensive computations.
JAVA#
A widely used programming language and computing platform often required by tools in bioinformatics pipelines, such as GATK.
Linux#
An open-source operating system commonly used in bioinformatics and server environments due to its stability and flexibility.
Nextflow#
A workflow management system designed for reproducible and scalable data analysis pipelines. Supports execution on local, cluster, and cloud environments.
nf-core pipelines#
Community-curated, standardised, and reproducible Nextflow pipelines for a wide range of bioinformatics analyses.
Orchestration VM#
A virtual machine responsible for managing or coordinating the execution of workflows and resources, especially in cloud-based analyses.
RAM (Random Access Memory)#
A type of computer memory that stores working data and code. Larger RAM allows more data to be processed simultaneously, which is critical for many bioinformatics tasks.
Screen#
A terminal multiplexer that allows multiple command-line sessions to be managed within a single terminal window. Useful for long-running processes.
Singularity#
The predecessor to Apptainer, Singularity is a container technology optimised for HPC and scientific computing environments.
Terraform#
An infrastructure-as-code tool that automates the provisioning and management of cloud infrastructure.
tmux (Terminal Multiplexer)#
Like Screen, tmux lets users run multiple terminal sessions in one window, detach and reattach to them, and manage long-running jobs.
TSV file (Tab-Separated Values)#
A plain text format for storing data in a tabular form, where each field is separated by a tab character.
VCF file (Variant Call Format)#
A standardised text file format for storing gene sequence variations, such as SNPs and indels, detected in sequencing data.
Virtual Machine (VM)#
A software-based emulation of a physical computer that runs an operating system and applications in an isolated environment.
Wall-time#
The actual elapsed time a task takes to complete from start to finish, as perceived by the user, regardless of CPU time used.
Zarr file#
A format for storing large N-dimensional arrays in a compressed, chunked manner. Useful in genomics for storing large datasets like 3D imaging or genomics matrices.