Core Workflow#

The modality XPLR Core Workflow is a simple-to-use analysis tool that leverages bash scripting to automate key modality XPLR commands on biomodal data contained in the Zarr store. The Core Workflow is well-suited for exploration of pilot study data, or for new users who want rapid insights from their data through a standardised, low-code approach.

For experienced users who want a deep dive into the Core Workflow script, proceed directly to the Advanced analysis workflow documentation.

This section describes how to set up and run the modality XPLR Core Workflow. This workflow is designed for human, biomodal 6-base datasets (including 5mC, 5hmC, and optionally modC), that are aligned to genome reference hg38. The workflow is pre-configured to analyse gene bodies and promoter regions only. See the Pre-prepared BED files section in the main documentation for more information on how the region files were generated.

  • Gene bodies: gencode.v44.human.genes.annotation.v1.1.0.bed.gz

  • Promoter regions: gencode.v44.human.promoters.annotation.v1.1.0.bed.gz

The Core Workflow performs the following analyses and outputs:

Biological QC

  1. biological-qc to summarise the dataset and sample information, with Pearson correlation and PCA plots for genome-wide 5mC, 5hmC, and optionally modC data. See Biological QC for more information.

  2. get mean to generate a .tsv.gz data file and support violin and correlation plots, showing the mean CpG coverage in gene bodies and promoter regions. Context depth filter is not applied to this command. See Extracting methylation statistics for more information on this and other get commands.

  3. get regional-frac to generate a .tsv.gz data file and support violin and correlation plots, showing the fraction of 5mC, 5hmC, and optionally modC (over total_c) in gene bodies and promoter regions, with an optional minimum depth filter applied.

  4. get count to generate a .tsv.gz data file containing the total number of contexts (total_c) in each gene body and promoter region. Context depth filter is not applied to this command, as it is not applicable to the count command.

Biological insight

  1. dmr call to identify differentially methylated regions (DMRs) in gene bodies and promoter regions, for 5mC, 5hmC, and optionally modC contexts, across all specified groups with an optional context depth filter applied. By default, the command generates two sets of results: one with overdispersion correction (_with_od_*.bed) and one without (_without_od_*.bed), stored in a single timestamped DMR_ directory for easy comparison. This command generates separate .bed files for each methylation context and group pair. See Calling differentially methylated regions (DMRs) for more information.

  2. The DMR results can then be visualised in modality XPLR Viewer (beta), including volcano plots, p-value distribution histograms and Manhattan plots. The Viewer (beta) allows DMR filtering by significance and export of the filtered results.

  3. The Viewer (beta) also supports generation of track plots for individual DMRs, showing the methylation signal across samples in the region. Custom track plots can be configured further using the .ini file generated with each DMR result, via the CLI command modality tracks.

The Core Workflow is well-suited for studies involving:

  • Disease vs control comparisons

  • Treatment vs untreated samples

  • Time-course experiments

  • Multi-group studies with covariates

Video: Running the modality XPLR Core Workflow#

Requirements and prerequisites#

Software requirements:

  • modality XPLR (version 1.1.0 or later is recommended, see Installation of modality XPLR)

  • Bash shell (version 4.0 or later)

  • Standard Unix utilities (awk, sed, find, etc.)

Data input and format requirements:

  • Zarr store(s): Generated by the upstream biomodal duet software, containing methylation data. Zarr stores must be readable by modality XPLR (generated by biomodal duet software v1.2 or later).

  • Sample metadata: .csv or .tsv file with sample information, group names and optional covariates (also referred to as the sample sheet). The metadata file (Sample sheet) must be formatted as described in the main documentation.

Core Workflow configuration#

For ease-of-use, the Core Workflow can be configured and downloaded using the Interactive Core Workflow generator tool below. This generator allows you to specify input parameters, and then download a complete .zip folder containing the workflow script, configuration file, and pre-prepared gene and promoter regions files.

Core Workflow folder contents:

modality-core-workflow_YYYYMMDD_HHMMSS/
├── core-workflow-v2.0.sh                              # The Core Workflow script
├── config.txt                                         # Configuration file for modality XPLR commands
├── Regions/                                           # Directory containing pre-prepared region files
│   ├── gencode.v44.human.genes.annotation.v1.1.0.bed.gz
│   └── gencode.v44.human.promoters.annotation.v1.1.0.bed.gz
├── README.txt                                         # Usage instructions
└── Output/                                            # Output directory(created during execution)

Interactive Core Workflow generator#

🔧 Core Workflow generator

Generate the modality XPLR Core Workflow .zip folder with a standardised directory structure. The folder name is timestamped upon creation and includes the workflow script, configuration file, and pre-prepared genomic region files. The displayed form values below are examples only, and must be replaced in order to populate the config.txt. After generating the .zip folder, a preview panel will appear. Check that your settings have been written correctly. * denotes required fields

Comma-separated paths, if multiple zarr stores are specified
Path to your sample metadata CSV or TSV file
Column name in your metadata CSV or TSV file for sample grouping
Space-separated order of conditions (control group first)
Space-separated covariate column names from your metadata
Select a region to limit the Biological QC to a specific region e.g. "chr" or "chr:<start>-<end>". This can help to speed up analysis and reduce memory usage.
Minimum context depth for regional-frac and DMR analysis e.g. 10. Leave blank to disable depth filtering.
Include modC modification type in biological-qc, feature extraction, and DMR calling. Default is true.
Overdispersion correction: "true" (with correction), "false" (without correction), or "both" (produces two result files, with and without correction). Core Workflow default is "both".

Running the Core Workflow#

To run the Core Workflow, follow these steps:

  1. Download the Core Workflow .zip folder to your working location and extract it.

  2. In the terminal, run the following lines.

# Change to the Core Workflow directory
cd /path/to/modality-core-workflow_YYYYMMDD_HHMMSS

# Make the script executable
chmod +x core-workflow-v2.0.sh

# Run the Core Workflow script
./core-workflow-v2.0.sh
  1. Monitor the output logs for progress and any potential errors.

Advanced Core Workflow usage#

The config.txt file contains all the parameters for running the Core Workflow script. You should not need to modify this file, as it is pre-configured by the Interactive Core Workflow generator. Advanced users can modify the config.txt file to change parameters such as the Zarr store path, metadata file, group column, covariates, and minimum depth filter. In addition, the Output directory path can be changed to a different location if required. This may be useful if you want to re-run the analysis with different parameters (e.g. context depth filter, or covariates), without needing to download the Core Workflow folder again.

It is also possible to provide additional region files in the Regions/ directory, or to change the region files used by the Core Workflow. Refer to the main documentation for the required format of the Region files. Apart from the Biological QC HTML report (which operates on Zarr store(s)), the Core Workflow will perform all analysis steps on each region file found in the Regions/ directory.

For advanced users who are familiar with bash scripting, a description of the Core Workflow code is provided in the Advanced analysis workflow section, which can be used to customise the workflow further.

Interpreting Core Workflow outputs#

The Core Workflow generates a comprehensive set of outputs organised into structured directories. Understanding this organisation is crucial for interpreting results and sharing findings. Each analysis produces timestamped outputs and provenance metadata for traceability and reproducibility of the results.

The workflow organises outputs with BioQC_ (timestamped biological QC results) directly under the main output directory, and region-specific subdirectories (one per region file) that each contain both Extract_ (feature extraction results) and DMR_ (DMR calling results) timestamped subdirectories. When overdispersion mode is set to both (default), each DMR analysis produces two BED files with _with_od_* and _without_od_* suffixes (followed by timestamps) for easy comparison.

Video: Reviewing Core Workflow results#

Output directory structure#

A modality_cli.log file is generated in the working directory and contains logging information relating to the modality XPLR CLI commands only (not the Core Workflow bash outputs). This file is generated automatically by modality XPLR and is not directly controlled by the Core Workflow script. Other outputs are described below.

Output/
├── BioQC_YYYYMMDD_HHMMSS/                            # Biological QC results directory
│   ├── biological_qc_report_*.html                   # Biological QC HTML report
│   └── modality_metadata_YYYYMMD_HHMMSS.json         # Provenance metadata file
├── <region_file_name_1>/                             # Results for first region bedfile (e.g. genes)
│   ├── Extract_YYYYMMDD_HHMMSS/                      # Feature extraction results directory
│   │   ├── Extract_*.tsv.gz                          # Feature extraction results file (compressed)
│   │   ├── Extract_*.html                            # Violin and Correlation plots (mean and regional-frac operations only)
│   │   └── modality_metadata_YYYYMMD_HHMMSS.json     # Provenance metadata file
│   └── DMR_YYYYMMDD_HHMMSS/                          # DMR analysis results directory
│       ├── DMR_*.bed                                 # DMR result files (see note below)
│       ├── DMR_*.ini                                 # Configuration files for making track plots with the 'tracks' CLI command
│       └── modality_metadata_YYYYMMD_HHMMSS.json     # Provenance metadata file
├── <region_file_name_2>/                             # Results for second region bedfile (e.g. promoters)
│   ├── Extract_YYYYMMDD_HHMMSS/
│   └── DMR_YYYYMMDD_HHMMSS/
└── <region_file_name_N>/                             # Results for Nth region bedfile
    ├── Extract_YYYYMMDD_HHMMSS/
    └── DMR_YYYYMMDD_HHMMSS/

Note

DMR filename conventions vary by overdispersion correction setting:

  • --overdispersion both (Core Workflow default): Produces two files per analysis with patterns _with_od_*.bed (with correction) and _without_od_*.bed (without correction), where * represents the timestamp

  • --overdispersion true or --overdispersion false: Produces one file per analysis named DMR_*.bed (no suffix). The overdispersion setting is recorded in the provenance metadata JSON file.

Reviewing results#

  1. Initiate modality XPLR Viewer (beta)

    • Start a Viewer session by invoking the modality viewer command in the terminal.

    • In the command, specify the root directory (-r) as the main output directory of the Core Workflow.

    • Navigate to the indicated localhost URL in your web browser to access the Viewer interface.

  2. Biological QC

    • Click on the tab for BioQC Results and open the analysis card for the Biological QC report.

    • Check sample clustering in the genome-wide PCA analysis. Use the colour-based grouping options directly on the report page to visualise sample relationships. Groups are defined by the metadata (sample sheet) file.

    • Verify expected grouping patterns. Does the data make biological sense for your experimental design?

    • Identify any outlier samples.

    • Assess overall sample mean CpG coverage in the Sample Information table. You can optionally export the metadata table including the per-sample CpG strand-merged coverage information.

    • Click on the tab for Extract Results, and filter to the region-specific analysis cards containing the modification types of interest.

    • Review the region-specific feature extraction plots (violin plots and correlation heatmaps) for mean CpG coverage, and 5mC and 5hmC fractions (and modC if included in the analysis). These visualisations provide insights into feature distributions and sample correlations.

  3. Biological insights

    • Click on the tab for DMR Results, then filter to the region-specific analysis cards containing the group comparisons and modification types of interest.

    • By default, the Core Workflow runs with overdispersion mode set to both, which produces two BED files for each analysis: one with overdispersion correction (_with_od_*.bed) and one without (_without_od_*.bed). You can toggle between these results on the analysis card in the Viewer. The overdispersion correction is designed to improve p-value distributions and DMR detection in datasets with high variability or small sample sizes. We recommend reviewing results with overdispersion correction first, then comparing to results without correction. Be sure to check the p-value distributions to determine which setting is most appropriate for your data.

    • The DMR comparison report shows all selected analyses on one page. To view a detailed DMR Report (including p-value distribution and Manhattan plot) for a single analysis, click the report icon above the respective volcano plot.

    • Apply significance filters and plot configurations via the available settings. Export images and filtered DMR result files as needed.

    • Note the region annotations in the DMR results table, which can be used to gain further biological insights into the DMRs.

    • Select a single DMR to see tracks plot options. Alternatively use the accompanying .ini files to generate custom tracks plots using the modality tracks CLI command.

  1. Validate findings

    • Check for consistency with known biology

    • Consider experimental design and potential confounders

    • Compare overdispersion-corrected vs uncorrected results (both provided when using --overdispersion both)

    • Use track plots to validate DMRs in surrounding context.

    • Plan follow-up validation experiments or to increase sample size if necessary

Common result patterns and interpretation#

Healthy DMR patterns:

  • Detection of significant DMRs between experimental groups

  • Consistent p-value distributions (uniform with some enrichment near zero)

  • Biologically meaningful effect sizes

  • Concordant patterns between 5mC and 5hmC where expected

Potential issues to watch for:

  • Poor group separation: May indicate experimental design issues

  • Skewed p-value distributions: May benefit from Overdispersion correction or other statistical adjustments

  • Very large effect sizes: May indicate technical artifacts

  • Inconsistent patterns: Could suggest batch effects or confounders

Biological interpretation guidelines:

  • Promoter hypermethylation: Often associated with deactivated regions

  • 5hmC enrichment: Often associated with active demethylation (transition to active regions), or a marker of active gene bodies.

  • Concordant changes: 5mC loss with 5hmC gain may suggest active demethylation

Advanced analysis workflow#

This section provides a more detailed description of the Core Workflow bash script, which may be beneficial to users who want to modify the script directly or build their own workflows.

The direct download link to the Core Workflow script is provided below, which provides the .sh file only.

Download: core-workflow-v2.0.sh - Core Workflow script assembled from all code sections below.

Download a test dataset#

You can download a test folder containing mouse ES-E14 data to run the Core Workflow. The outputs are not intended to reveal biological insight, as the samples are technical replicates. However, it can be used to test that the script runs to completion.

The E14-example test folder contains:

  • A sample metadata file E14_chromosome_19.tsv

  • A zarr store E14_chromosome_19.zarrz containing methylation data

  • A directory Regions containing gencode.vM25.mouse.promoters.annotation.bed

  • A config.txt file with example parameters

Download: E14-example.zip

Note

The E14-example.zip does not contain the core-workflow-v2.0.sh script. This will need to be downloaded separately and run from within the E14-example directory after extraction.

Advanced configuration file setup#

The Core Workflow uses a simple config.txt file for all settings. This design separates configuration from code, making it easy to run the same pipeline with different datasets or parameters. If you are new to modality XPLR or have not yet run the Core Workflow, we suggest using the Interactive Core Workflow generator to get started.

Configuration file:

This file must be present in your working directory before running the pipeline, with the name config.txt. Alternatively use the Advanced interactive configuration file generator below to create this file with additional options. The aim of the config.txt is to pre-define all parameters required by the workflow script. If you modify the Core Workflow script to include additional options or settings, these can be added to the config.txt as variables. For the Core Workflow, the config.txt file should contain the following parameters (parameter values are examples):

Zarr=/path/to/data.zarrz
Metadata=/path/to/metadata.csv
Group_Column=label
Condition_Order=condition1,condition2
Covariates=age,batch,sex
QC_Region=chr1
Regions_Directory=/path/to/regions
Main_Output=/path/to/output_dir
Depth_Filter=5
Include_modC=true
Overdispersion=both

Configuration parameters explained:

  • Zarr: Path to zarr store(s). Use space separation for multiple paths

  • Metadata: Path to .csv or .tsv sample sheet file containing sample metadata

  • Group_Column: Column name in the metadata file, for grouping samples (e.g., “treatment” or “condition”)

  • Covariates: Space-separated list of covariate columns (optional) (e.g., “age batch sex race”)

  • QC_Region: Region specification for Biological QC analysis (optional) (e.g., “chr1” or “chr1:1000-2000”)

  • Regions_directory: Directory containing BED/TSV files in BED3+ format, with genomic regions and annotations. See Region files for details

  • Main_Output: Base directory where all results will be written

  • Depth_Filter: Minimum depth filter for analyses (leave it blank to disable filtering)

  • Include_modC: Include modC modification type in analyses (“true” or “false”, optional, default: true)

  • Overdispersion: Overdispersion mode for DMR analysis: “true”, “false”, or “both” (optional, Core Workflow default: both). When set to “both”, produces two BED files per analysis—one with and one without overdispersion correction

For a description of Sample sheet and Region files formats, see the linked modality XPLR documentation sections.

The config.txt can be generated using the Advanced interactive configuration file generator below or created manually in a text editor.

Advanced interactive configuration file generator#

🔧 Advanced Configuration File Generator

Use this form to generate and download your the config.txt file. A download link for the core-workflow-v2.0.sh is also provided. The displayed values are indications only, and must be replaced in order to populate the config.txt. After generating the file, a preview panel will appear. Check that your settings have been written correctly.

Comma-separated paths if multiple zarr stores
Column name in the metadata (sample sheet) file to be used for retrieving sample groups
Space-separated order of conditions (control group first)
Space-separated list of covariate column names, in the metadata (sample sheet) file
Select a region to limit the Biological QC to a specific region e.g. "chr" or "chr:<start>-<end>". This can help to speed up analysis and reduce memory usage.
Directory containing pre-prepared BED/TSV region definition files (e.g. promoters, genes etc.)
Main output directory, in which all results and reports will be saved.
Minimum context depth for regional-frac and DMR analysis e.g. 10. Leave blank to disable depth filtering.
Include modC modification type in biological-qc, feature extraction, and DMR calling. Default is true.
Overdispersion correction: "true" (with correction), "false" (without correction), or "both" (produces two result files, with and without correction). Core Workflow default is "both".

Core Workflow overview: step-by-step#

The Core Workflow script is organised into distinct sections that can be modified independently:

  1. Setup and Initialisation: Script configuration and logging setup

  2. Input Validation: Comprehensive checks of all inputs before execution

  3. Analysis Steps: Three sequential analysis stages, designed to deliver Biological QC and Biological Insight analyses.

Script setup and configuration#

The script begins with essential setup including logging configuration and error handling.

Note

This code block includes instructions for enabling logging to a file, which may be useful for debugging and tracking script execution. To enable, uncomment (remove #) from the code lines as directed.

#!/usr/bin/env bash

###############################################################################
# modality XPLR Core Workflow Pipeline
#
# This script runs a comprehensive analysis pipeline including:
# - Biological QC
# - Feature extraction
# - DMR calling
#
# Input requirements:
# - config.txt file in current directory with required parameters
# - Zarr store(s) containing methylation data
# - Sample metadata CSV or TSV file
# - Directory with region BED/TSV files
#
# This script is intended for running on modality XPLR version 1.1.0 or later.
# Results are designed to support visualisations in modality Viewer (beta).
#
###############################################################################

#------------------------------------------------------------------------------
# CONFIGURATION
#------------------------------------------------------------------------------
# Set script behaviour
set -euo pipefail

# Record start time for performance tracking
start_time=$(date +%s)

# Log file setup - uncomment to enable logging to file
# LOG_FILE="./modality_pipeline_$(date +%Y%m%d_%H%M%S).log"
# exec > >(tee -a "$LOG_FILE") 2>&1

Helper functions#

These logging functions provide consistent feedback throughout the pipeline execution.

#------------------------------------------------------------------------------
# HELPER FUNCTIONS
#------------------------------------------------------------------------------
# Logging functions
log_info() {
  echo "[INFO] $(date '+%Y-%m-%d %H:%M:%S') - $1"
}

log_warning() {
  echo "[WARNING] $(date '+%Y-%m-%d %H:%M:%S') - $1" >&2
}

log_error() {
  echo "[ERROR] $(date '+%Y-%m-%d %H:%M:%S') - $1" >&2
}

log_debug() {
  if [[ "${DEBUG:-false}" == "true" ]]; then
    echo "[DEBUG] $(date '+%Y-%m-%d %H:%M:%S') - $1"
  fi
}

Utility functions#

These utility functions handle common tasks like directory validation and report collection.

 # Helper to trim whitespace
trim() {
  awk '{$1=$1};1'
}

# Function to check if a header exists in metadata
header_exists() {
  local col="$1"
  local headers=("${@:2}")

  for h in "${headers[@]}"; do
    [[ "$h" == "$col" ]] && return 0
  done
  return 1
}

# Function to create directories with validation
create_directory() {
  local dir="$1"

  if [[ ! -d "$dir" ]]; then
    mkdir -p "$dir" 2>/dev/null || { log_error "Cannot create directory $dir"; exit 1; }
  fi
  if [[ ! -w "$dir" ]]; then
    log_error "No write access to directory $dir"
    exit 1
  fi
  log_info "Directory ready: $dir"
}

Initialisation and configuration loading#

This section initialises the script and loads the configuration file. If you want to change the name of the input configuration file from the default config.txt, you can do so by changing the CONFIG_FILE variable in the script. Similarly, if you want to control the verbosity of workflow logging, you can do so by changing the DEBUG variable in the script to true.

#------------------------------------------------------------------------------
# INITIALISATION
#------------------------------------------------------------------------------
printf "\n=======================================\n"
printf "STEP 0 - CONFIGURATION & INITIALISATION\n"
printf "=======================================\n"
log_info "Starting modality XPLR Core Workflow Pipeline"

# Default values for variables that could be customised
CONFIG_FILE="./config.txt"
DEBUG=false  # Set to true to enable debug output

The next part will read the configuration file and set values for all required parameters. If you wish to change existing parameters values, you can do so by editing the configuration (config.txt) file in your working directory. If you want to add new variables to be used in downstream modality XPLR commands, ensure they are defined here and then prompted to assign a value when reading in the configuration file. For any variables that should be hard-coded, the value can be set directly in the script and excluded from the configuration file.

#------------------------------------------------------------------------------
# LOAD CONFIGURATION
#------------------------------------------------------------------------------
log_info "Reading configuration from $CONFIG_FILE"

# Validate config file exists
if [[ ! -f "$CONFIG_FILE" ]]; then
  log_error "config.txt not found in the current working directory: $(pwd)"
  log_error "Please ensure config.txt is present before running this script."
  exit 1
fi

# Read config.txt variables
ZARR=""                 # Path(s) to zarr store(s), comma-separated
METADATA=""             # Path to sample metadata CSV or TSV file
GROUP_COLUMN=""         # Column name in metadata for grouping samples
COVARIATES=""           # Space-separated list of covariate column names (optional)
CONDITION_ORDER=""      # Comma-separated order of conditions for DMR analysis (optional)
QC_REGION=""            # Region for Biological QC analysis (optional)
REGIONS_DIR=""          # Directory containing region BED/TSV files
MAIN_OUTPUT=""          # Base output directory
DEPTH_FILTER=""         # Minimum depth filter for analyses, e.g. 10 (optional)
INCLUDE_MODC=""         # Include modC in analyses: "true" or "false" (optional, default: true)
OVERDISPERSION=""       # Overdispersion mode: "true", "false", or "both" (optional, Core Workflow default: both)

# Parse config file
while IFS='=' read -r key value || [[ -n "$key" ]]; do
  key=$(echo "$key" | trim)
  value=$(echo "$value" | trim)
  case "$key" in
    Zarr) ZARR="$value" ;;
    Metadata) METADATA="$value" ;;
    Group_Column) GROUP_COLUMN="$value" ;;
    Covariates) COVARIATES="$value" ;;
    Condition_Order) CONDITION_ORDER="$value" ;;
    QC_Region) QC_REGION="$value" ;;
    Regions_Directory) REGIONS_DIR="$value" ;;
    Main_Output) MAIN_OUTPUT="$value" ;;
    Depth_Filter) DEPTH_FILTER="$value" ;;
    Include_modC) INCLUDE_MODC="$value" ;;
    Overdispersion) OVERDISPERSION="$value" ;;
    # You could add additional parameters here as needed
  esac
done < "$CONFIG_FILE"

# Log parsed configuration
printf "\n\n>>Provided inputs\n"
log_info "ZARR path(s): $ZARR"
log_info "Metadata file: $METADATA"
log_info "Group column: $GROUP_COLUMN"
log_info "Covariates: $COVARIATES"
log_info "Condition order: $CONDITION_ORDER"
log_info "QC region: $QC_REGION"
log_info "Regions directory: $REGIONS_DIR"
log_info "Main output directory: $MAIN_OUTPUT"
log_info "Depth filter: $DEPTH_FILTER"
log_info "Include modC in analyses: $INCLUDE_MODC"
log_info "Overdispersion mode: $OVERDISPERSION"

Input validation#

This section validates all inputs and ensures they exist and are accessible. Use this section to add any custom validation checks for your specific workflow requirements, file paths, and access permissions, according to the variables defined in the config.txt file.

#------------------------------------------------------------------------------
# VALIDATE INPUTS
#------------------------------------------------------------------------------
printf "\n\n>>Validation & directory creation\n"
log_info "Validating inputs..."

# Parse and validate zarr paths
if [[ -n "$ZARR" ]]; then
  # Split comma-separated zarr paths into array
  IFS=',' read -r -a ZARR_RAW_ARRAY <<< "$ZARR"

  # Process each zarr path to remove quotes and trim whitespace
  ZARR_ARRAY=()
  for zarr_raw in "${ZARR_RAW_ARRAY[@]}"; do
    # Trim leading/trailing whitespace
    zarr_clean=$(echo "$zarr_raw" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')

    # Remove surrounding single or double quotes if present
    zarr_clean=$(echo "$zarr_clean" | sed 's/^["'"'"']\(.*\)["'"'"']$/\1/')

    # Add to processed array if not empty
    if [[ -n "$zarr_clean" ]]; then
      ZARR_ARRAY+=("$zarr_clean")
    fi
  done

  # Log the number of zarr paths identified and their strings
  log_info "Identified ${#ZARR_ARRAY[@]} zarr path(s):"
  for i in "${!ZARR_ARRAY[@]}"; do
    log_info "  [$((i+1))] ${ZARR_ARRAY[i]}"
  done

  # Validate each zarr path
  for zarr in "${ZARR_ARRAY[@]}"; do
    if [[ ! -s "$zarr" ]]; then
      log_error "Zarr file '$zarr' doesn't contain any data or is not readable"
      exit 1
    fi
  done
else
  log_error "No zarr paths specified in config.txt"
  exit 1
fi

# Process and validate metadata file path
if [[ -n "$METADATA" ]]; then
  # Trim leading/trailing whitespace
  METADATA=$(echo "$METADATA" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')

  # Remove surrounding single or double quotes if present
  METADATA=$(echo "$METADATA" | sed 's/^["'"'"']\(.*\)["'"'"']$/\1/')

  log_info "Processed metadata file path: $METADATA"
else
  log_error "No metadata file specified in config.txt"
  exit 1
fi

# Process and validate main output directory path
if [[ -n "$MAIN_OUTPUT" ]]; then
  # Trim leading/trailing whitespace
  MAIN_OUTPUT=$(echo "$MAIN_OUTPUT" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')

  # Remove surrounding single or double quotes if present
  MAIN_OUTPUT=$(echo "$MAIN_OUTPUT" | sed 's/^["'"'"']\(.*\)["'"'"']$/\1/')

  log_info "Processed main output directory: $MAIN_OUTPUT"
else
  log_error "No main output directory specified in config.txt"
  exit 1
fi

# Process and validate regions directory path
if [[ -n "$REGIONS_DIR" ]]; then
  # Trim leading/trailing whitespace
  REGIONS_DIR=$(echo "$REGIONS_DIR" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')

  # Remove surrounding single or double quotes if present
  REGIONS_DIR=$(echo "$REGIONS_DIR" | sed 's/^["'"'"']\(.*\)["'"'"']$/\1/')

  log_info "Processed regions directory: $REGIONS_DIR"
else
  log_error "No regions directory specified in config.txt"
  exit 1
fi

# Check metadata file
log_info "Validating metadata file headers..."
if [[ ! -r "$METADATA" ]]; then
  log_error "Metadata file not found or not readable: $METADATA"
  exit 1
fi

# Check group column and covariates in metadata
HEADER_LINE=$(head -n 1 "$METADATA")

# Remove any line ending characters (Windows \r\n, Mac \r, Unix \n)
HEADER_LINE=$(echo "$HEADER_LINE" | tr -d '\r\n')

# Detect delimiter: use tab if present, otherwise comma
if [[ "$HEADER_LINE" == *$'\t'* ]]; then
  DELIM=$'\t'
else
  DELIM=','
fi

# Split header into array using the detected delimiter
IFS="$DELIM" read -r -a HEADERS_RAW <<< "$HEADER_LINE"

# Trim whitespace from each header individually
HEADERS=()
for header in "${HEADERS_RAW[@]}"; do
  # Trim leading and trailing whitespace from each header
  header_clean=$(echo "$header" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
  if [[ -n "$header_clean" ]]; then
    HEADERS+=("$header_clean")
  fi
done

if [[ -n "$GROUP_COLUMN" ]]; then
  if ! header_exists "$GROUP_COLUMN" "${HEADERS[@]}"; then
    log_error "Group column '$GROUP_COLUMN' not found in metadata file header. Please review the metadata file and provide a valid group column."
    exit 1
  fi
fi

if [[ -n "$COVARIATES" ]]; then
  # Convert space-separated list to array
  read -r -a COVS <<< "$COVARIATES"
  for cov in "${COVS[@]}"; do
    cov=$(echo "$cov" | trim)
    if ! header_exists "$cov" "${HEADERS[@]}"; then
      log_error "Covariate column '$cov' not found in metadata file header"
      exit 1
    fi
  done
fi

log_info "Metadata validation passed"

# Check regions directory
if [[ ! -d "$REGIONS_DIR" ]]; then
  log_error "Regions directory not found: $REGIONS_DIR"
  exit 1
fi

Output directory setup#

This section creates region-specific subdirectories in MAIN_OUTPUT to organise the results.

All results for each modality XPLR CLI operation (including provenance metadata) will be written to timestamped subdirectories (Extract_* and DMR_*) within the corresponding region-specific folder. All modality XPLR outputs are timestamped, however you may wish to modify this section to add additional sub-directories for specific analyses, or to change the output directory structure.

#------------------------------------------------------------------------------
# SETUP OUTPUT DIRECTORIES
#------------------------------------------------------------------------------
log_info "Creating output directories under $MAIN_OUTPUT..."

# Create main output directory
create_directory "$MAIN_OUTPUT"

# Create region-specific subdirectories directly in the main output directory
log_info "Creating region-specific subdirectories..."
for bedfile in "$REGIONS_DIR"/*.bed "$REGIONS_DIR"/*.bed.gz "$REGIONS_DIR"/*.tsv "$REGIONS_DIR"/*.tsv.gz; do
  # Skip if no files match pattern
  [[ -e "$bedfile" ]] || continue
  [[ "${METADATA##*.}" == "tsv" && $(realpath "$bedfile") == $(realpath "$METADATA") ]] && continue

  # Extract region file name without path and extension
  region_name=$(basename "$bedfile")
  region_name="${region_name%%.bed.gz}"
  region_name="${region_name%%.bed}"
  region_name="${region_name%%.tsv.gz}"
  region_name="${region_name%%.tsv}"

  # Create region-specific subdirectory in main output directory
  create_directory "$MAIN_OUTPUT/$region_name"
  log_info "Created region directory for: $region_name"
done

Step 1: Biological QC HTML report#

The first analysis step performs biological quality control to provide Zarr store summaries, and to identify sample relationships.

What this script block does:

  • Generates correlation matrices between samples for 5mC and 5hmC (and modC if included)

  • Creates PCA plots to visualise sample clustering for each methylation context (grouping options are available on the Viewer report page)

  • Calculates mean CpG coverage for each sample

Key parameters:

  • Uses all zarr stores specified in configuration (config.txt)

  • Allows sample metadata to be used for dynamic colour-based grouping, when available

Note

The Core Workflow does not include the --aggregate-by-group option, which could be included to aggregate sample data for technical replicates prior to QC.

#------------------------------------------------------------------------------
# STEP 1: BIOLOGICAL QC
#------------------------------------------------------------------------------
printf "\n\n======================================\n"
printf "STEP 1: RUNNING BIOLOGICAL QC ANALYSES\n"
printf "======================================\n"

# Build region argument if QC_REGION is set
REGION_ARG=""
[[ -n "$QC_REGION" ]] && REGION_ARG="--region $QC_REGION"

# Set default for Include_modC if not specified
[[ -z "$INCLUDE_MODC" ]] && INCLUDE_MODC="true"

# Build methylation contexts based on Include_modC setting
if [[ "$INCLUDE_MODC" == "true" ]]; then
  MODC_CONTEXTS="mc hmc modc"
else
  MODC_CONTEXTS="mc hmc"
fi

# Run QC without aggregation
log_info "Running biological QC without aggregation..."
modality -v biological-qc \
  --zarr-path "${ZARR_ARRAY[@]}" \
  --sample-sheet "$METADATA" \
  $REGION_ARG \
  --methylation-contexts $MODC_CONTEXTS \
  --order-by-group "$GROUP_COLUMN" \
  --output-dir "$MAIN_OUTPUT"

log_info "Biological QC completed"

Expected outputs:

  • biological_qc_report.html: HTML report with sample-wise plots and statistics.

  • Provenance metadata .json files, stored in each Results/BioQC_YYYYMMDD_HHMMSS sub-directory that is generated by each modality XPLR CLI command.

Interpreting QC results:

  • Correlation heatmaps: Correlation coefficients close to 1 indicate high similarity of 5mC and 5hmC (and modC if included) levels between samples

  • PCA plots: Separation and/or clustering between experimental groups indicates the % variability that is explained by the first two principal components, for 5mC and 5hmC (and modC if included).

  • Coverage statistics: The mean CpG coverage per sample is an indicator of the usable depth for methylation analysis. See the next step for feature extraction (region-based) coverage statistics.

  • Outlier detection: Samples that don’t cluster with their expected group may need investigation. Removing a sample from the metadata (sample sheet) file will exclude it from future analysis, without needing to modify the original Zarr file.

Step 2: Region-based Biological QC#

This step used the modality get tool to extract methylation features (statistics) from user-specified genomic regions, for additional quality control and analysis. The regions can be defined in BED3+ format (.bed` or ``.tsv) files, that are accessible in the REGIONS_DIR, defined in config.txt. For the Core Workflow script to run correctly, region files must be present. Pre-prepared BED files are provided to get started, or custom files following the same format can be used instead. Use the correct region files for the reference genome used for alignment in the duet Software.

Tip

The Annotation (e.g. gene, or promoter) and Name (e.g. associated gene name) columns should be completed for meaningful interpretation.

The pipeline performs multiple types of statistics extraction to provide comprehensive regional methylation data. Each statistic type is processed separately by modality XPLR, and the results are stored in timestamped Extract_* subdirectories within each region-specific folder.

Note

Only modality get mean and modality get regional-frac commands support plotting (correlation or violin), which are useful for visualising the distribution of methylation levels across samples. If no HTML report is generated, then the results will only be written as .tsv.gz files within the respective Extract_* timestamped subdirectories.

What this script block does:

  • Count extraction: Returns the total number of CpG contexts (total_c) for each region.

  • Mean calculation: Average coverage of CpG contexts (total_c) for each region, with violin and correlation plots for individual samples.

  • Regional fraction: Methylation fractions for 5mC and 5hmC (and modC if included) at the pre-specified minimum``DEPTH_FILTER`` value, with violin and correlation plots for individual samples.

  • Processes each region file in the REGIONS_DIR independently. Therefore, expect individual results for each region file and extraction type.

#------------------------------------------------------------------------------
# STEP 2: FEATURE EXTRACTION
#------------------------------------------------------------------------------
printf "\n\n==================================================\n"
printf "STEP 2: RUNNING FEATURE EXTRACTION ON REGION FILES\n"
printf "==================================================\n"

# Process each region file for feature extraction
for bedfile in "$REGIONS_DIR"/*.bed "$REGIONS_DIR"/*.bed.gz "$REGIONS_DIR"/*.tsv "$REGIONS_DIR"/*.tsv.gz; do
  # Skip if no files match pattern
  [[ -e "$bedfile" ]] || continue
  [[ "${METADATA##*.}" == "tsv" && $(realpath "$bedfile") == $(realpath "$METADATA") ]] && continue

  log_info "Processing region file: $bedfile"

  # Extract region file name without path and extension
  region_name=$(basename "$bedfile")
  region_name="${region_name%%.bed.gz}"
  region_name="${region_name%%.bed}"
  region_name="${region_name%%.tsv.gz}"
  region_name="${region_name%%.tsv}"

  # Set region-specific output directory
  region_output_dir="$MAIN_OUTPUT/$region_name"

  # 1. Extract count for total_c
  log_info "Running feature extraction: count total_c..."
  modality -v get count \
    --zarr-path "${ZARR_ARRAY[@]}" \
    --sample-sheet "$METADATA" \
    --bedfile "$bedfile" \
    --fields total_c \
    --output-dir "$region_output_dir"

  # 2. Extract mean for total_c with plots generated automatically
  log_info "Running feature extraction: mean total_c (plots generated automatically)..."
  modality -v get mean \
    --zarr-path "${ZARR_ARRAY[@]}" \
    --sample-sheet "$METADATA" \
    --bedfile "$bedfile" \
    --fields total_c \
    --order-by-group "$GROUP_COLUMN" \
    --output-dir "$region_output_dir"

  # 3. Extract regional-frac for mc, hmc, and optionally modc with depth filter (plots generated automatically)
  if [[ "$INCLUDE_MODC" == "true" ]]; then
    log_info "Running feature extraction: regional-frac for mc, hmc, and modc with minimum depth $DEPTH_FILTER (plots generated automatically)..."
    REGIONAL_FRAC_FIELDS="mc hmc modc"
  else
    log_info "Running feature extraction: regional-frac for mc and hmc with minimum depth $DEPTH_FILTER (plots generated automatically)..."
    REGIONAL_FRAC_FIELDS="mc hmc"
  fi
  # Build the optional argument if DEPTH_FILTER is set
  MCV_ARG=""
  [[ -n "$DEPTH_FILTER" ]] && MCV_ARG="--min-coverage $DEPTH_FILTER"
  # --min-coverage 30, if DEPTH_FILTER=30 in config.txt

  modality -v get regional-frac \
    --zarr-path "${ZARR_ARRAY[@]}" \
    --sample-sheet "$METADATA" \
    --bedfile "$bedfile" \
    --fields $REGIONAL_FRAC_FIELDS \
    $MCV_ARG \
    --order-by-group "$GROUP_COLUMN" \
    --output-dir "$region_output_dir"

  log_info "Feature extraction completed for $region_name"
done

log_info "Feature extraction completed for all region files"

Expected outputs:

  • Extract_* directories: Organised by statistic type and region file

  • *.tsv.gz files: Compressed quantitative results for each extracted statistic, by region and sample

  • *.html reports: Interactive violin and correlation plots showing methylation statistics for mean and regional-frac extraction types.

Note

The optional regional dendrogram heatmap is not included in the Core Workflow script by default but can be added to the HTML report by modifying the modality get mean and/or modality get regional-frac command(s) to include --heatmap. Note that the heatmap will only be drawn if there are ≤200 regions in the region file, so it is not appropriate for use with all promoters and gene bodies. Instead, add a custom subset region file to the Regions directory before running the Core Workflow. Alternative tools for generating heatmaps are available in modality Viewer (beta).

Interpreting region-based Biological QC statistics:

  • Region-specific CpG coverage: Should show adequate and consistent coverage across samples and may be used to inform the DEPTH_FILTER value for future analyses.

  • Methylation fraction distributions: Violin and correlation plots reveal methylation heterogeneity.

  • Outlier samples: Samples with unusual distributions may need investigation.

Customising statistics extraction:

You can modify this section of the script to:

  • Add different field types (see Zarr store and Extracting methylation statistics sections in the main modality XPLR documentation).

  • Adjust minimum coverage thresholds for regional fraction extraction.

  • Include additional statistical summaries (count, mean, sum, regional-frac).

  • Add an optional --heatmap parameter to generate regional dendrogram heatmaps for mean and/or regional-frac extraction types with ≤200 regions. Alternative tools for generating heatmaps are available in modality Viewer (beta).

Step 3: DMR calling#

This step identifies differentially methylated regions using statistical modelling, processing each region file individually. DMR calling is the core analytical component that identifies regions with significant methylation differences between two experimental groups.

What this script block does:

  • Processes each .bed and .tsv file in the regions directory separately

  • Performs statistical testing for differential methylation on each region file (see Calling differentially methylated regions (DMRs))

  • Analyses 5mC and 5hmC modifications separately (and modC if included)

  • Analyses all available condition pairs from the metadata, preserving the sample sheet order for use with the --condition-order parameter. Ensure the control group is the first one listed in the metadata file; or analyses only the conditions specified in the config.txt, with the left-most condition name as the control group.

  • Applies multiple testing correction (FDR)

  • Incorporates covariates if specified

Note

The Core Workflow defaults to --overdispersion both, which runs the DMR model twice—once with and once without overdispersion correction—in a single invocation. Two BED result files are produced with _with_od_* and _without_od_* suffixes (followed by timestamps). This allows you to compare results and choose the most appropriate model for your data. For datasets with high variability or small sample sizes, overdispersion correction often improves statistical inference. For more information on when to use overdispersion, see Overdispersion correction.

Key analysis parameters:

  • Methylation contexts: Analyses mc and hmc (and modc if included)

  • Depth filtering: Should be applied according to the sequencing depth and methylation context coverage, and specified in the config.txt file

  • Covariate adjustment: Controls for confounding variables when specified in the config.txt file

#------------------------------------------------------------------------------
# STEP 3: DMR CALLING
#------------------------------------------------------------------------------
printf "\n\n===========================\n"
printf "STEP 3: RUNNING DMR CALLING\n"
printf "===========================\n"

# Build covariate / depth_filter / condition_order / overdispersion flags if present
COV_FLAGS=""
DEPTH_ARG=""
CONDITION_ARG=""
OVERDISPERSION_ARG=""

[[ -n "$DEPTH_FILTER" ]] && DEPTH_ARG="--filter-context-depth $DEPTH_FILTER"
[[ -n "$CONDITION_ORDER" ]] && CONDITION_ARG="--condition-order $CONDITION_ORDER"

if [[ -n "$COVARIATES" ]]; then
  # Use space-separated covariates directly for --covariates flag
  COV_FLAGS="--covariates $COVARIATES"
fi

# Set default for overdispersion if not specified
[[ -z "$OVERDISPERSION" ]] && OVERDISPERSION="both"

# Build overdispersion argument
OVERDISPERSION_ARG="--overdispersion $OVERDISPERSION"
log_info "Running DMR calling with overdispersion mode: $OVERDISPERSION"

# Run DMR calling on each region file individually
for bedfile in "$REGIONS_DIR"/*.bed "$REGIONS_DIR"/*.bed.gz "$REGIONS_DIR"/*.tsv "$REGIONS_DIR"/*.tsv.gz; do
  # Skip if no files match pattern
  [[ -e "$bedfile" ]] || continue
  [[ "${METADATA##*.}" == "tsv" && $(realpath "$bedfile") == $(realpath "$METADATA") ]] && continue

  log_info "Processing region file: $bedfile"

  # Extract region file name without path and extension
  region_name=$(basename "$bedfile")
  region_name="${region_name%%.bed.gz}"
  region_name="${region_name%%.bed}"
  region_name="${region_name%%.tsv.gz}"
  region_name="${region_name%%.tsv}"

  # Set region-specific output directory
  region_output_dir="$MAIN_OUTPUT/$region_name"

  # Run DMR calling
  # --covariates cov1 cov2, if COVARIATES="cov1 cov2" in config.txt
  # --filter-context-depth 15, if DEPTH_FILTER=15 in config.txt
  # --condition-order condition1,condition2, if CONDITION_ORDER="condition1,condition2" in config.txt
  # --overdispersion both, if OVERDISPERSION="both" in config.txt (or default)

  modality -v dmr call \
    --zarr-path "${ZARR_ARRAY[@]}" \
    --sample-sheet "$METADATA" \
    --condition-array-name "$GROUP_COLUMN" \
    --methylation-contexts $MODC_CONTEXTS \
    --bedfile "$bedfile" \
    $COV_FLAGS \
    $DEPTH_ARG \
    $CONDITION_ARG \
    $OVERDISPERSION_ARG \
    --output-dir "$region_output_dir"

  log_info "DMR calling completed for $bedfile"
done

log_info "All DMR calling completed"

Expected outputs:

  • DMR_* directory: All DMR results (for all group pairs and contexts) are stored in a single directory with a timestamp

  • *.bed files: DMR results in BED3+ format for each condition pair and methylation context

  • *.ini files: Configuration files for generating custom tracks plots via the modality XPLR CLI. One .ini file is generated for each DMR .bed result file

Advanced customisation and extension#

The Core Workflow is designed to be easily adapted for different research needs. Here are common modifications and extensions:

Customising analysis parameters#

Adjusting coverage thresholds:

# In the config.txt or script
DEPTH_FILTER=15  # Increase for higher quality regions

# Or modify specific commands directly using the --min-coverage option
modality get regional-frac \
  --zarr-path "${ZARR_ARRAY[@]}" \
  --bedfile "$bedfile" \
  --fields mc hmc modc \
  --min-coverage 15 \  # Custom minimum coverage
  --output-dir "$RESULTS_DIR"

Modify the methylation contexts in DMR calling:

# Modify the methylation contexts in DMR calling
modality dmr call \
  --methylation-contexts mc hmc \  # Exclude modc if not needed
  # ... other parameters

Note

The Core Workflow defaults to --overdispersion both, which produces two result files per analysis in the same timestamped DMR directory: one with overdispersion correction (_with_od_*.bed) and one without (_without_od_*.bed), where the suffix appears before the timestamp. This allows comparison of results under different statistical models. If --overdispersion true or false is specified instead, only one result file per analysis is produced without a suffix.

Performance optimisation tips#

For large datasets:

  • Use subset regions for initial testing

  • Implement parallel processing for independent steps

  • Monitor disk space and memory usage

  • Consider using HPC environments for large-scale analysis

For production use:

  • Add comprehensive logging and monitoring

  • Implement checkpointing for long-running analyses

  • Add email notifications for completion/failure

  • Include resource usage reporting

Script completion and monitoring#

The script includes basic completion reporting and performance tracking:

#------------------------------------------------------------------------------
# COMPLETION
#------------------------------------------------------------------------------
log_info "Pipeline completed successfully. Results are available in $MAIN_OUTPUT"

# Calculate and report duration
end_time=$(date +%s)
duration=$((end_time - start_time))
log_info "Script completed in $duration seconds"

# Exit cleanly
exit 0

Monitoring features:

  • Execution time tracking: Reports total pipeline duration

  • Success confirmation: Clear indication of successful completion

  • Output location reporting: Directs users to results

  • Clean exit handling: Proper script termination

Monitoring execution:

The script provides real-time logging output. You can also redirect output to a log file using the code below, or code it into the script as described in Script setup and configuration.

./core-workflow-v2.0.sh 2>&1 | tee workflow.log

Expected runtime:

Runtime varies significantly based on:

  • Dataset size: Number of samples and genomic regions

  • Analysis complexity: Number of methylation contexts and covariates

  • System resources: Available CPU and memory

  • Storage speed: Local vs network storage performance

This modular workflow provides a comprehensive foundation for methylation analysis using the modality XPLR toolkit. By understanding each component, you can adapt it to meet your specific research requirements while maintaining robust error handling and organised outputs.

Warning

Version compatibility: This pipeline is designed for modality XPLR version 1.1.0 and later. Results are optimised for visualisation in modality Viewer (beta). Some features may not be available in earlier versions. Check your modality XPLR version with modality --version before running.

Tip

Getting help: If you encounter issues, first try enabling debug mode (DEBUG=true) for detailed logging. The modular design allows you to run individual sections for troubleshooting specific steps.