Core Workflow#
The modality XPLR Core Workflow is a simple-to-use analysis tool that leverages bash scripting to automate key modality XPLR commands on biomodal data contained in the Zarr store. The Core Workflow is well-suited for exploration of pilot study data, or for new users who want rapid insights from their data through a standardised, low-code approach.
For experienced users who want a deep dive into the Core Workflow script, proceed directly to the Advanced analysis workflow documentation.
This section describes how to set up and run the modality XPLR Core Workflow.
This workflow is designed for human, biomodal 6-base datasets (including 5mC, 5hmC, and optionally modC), that are aligned to genome reference hg38.
The workflow is pre-configured to analyse gene bodies and promoter regions only. See the Pre-prepared BED files section in the main documentation for more information on how the region files were generated.
Gene bodies:
gencode.v44.human.genes.annotation.v1.1.0.bed.gzPromoter regions:
gencode.v44.human.promoters.annotation.v1.1.0.bed.gz
The Core Workflow performs the following analyses and outputs:
Biological QC
biological-qcto summarise the dataset and sample information, with Pearson correlation and PCA plots for genome-wide5mC,5hmC, and optionallymodCdata. See Biological QC for more information.get meanto generate a.tsv.gzdata file and support violin and correlation plots, showing the mean CpG coverage in gene bodies and promoter regions. Context depth filter is not applied to this command. See Extracting methylation statistics for more information on this and othergetcommands.get regional-fracto generate a.tsv.gzdata file and support violin and correlation plots, showing the fraction of5mC,5hmC, and optionallymodC(overtotal_c) in gene bodies and promoter regions, with an optional minimum depth filter applied.get countto generate a.tsv.gzdata file containing the total number of contexts (total_c) in each gene body and promoter region. Context depth filter is not applied to this command, as it is not applicable to thecountcommand.
Biological insight
dmr callto identify differentially methylated regions (DMRs) in gene bodies and promoter regions, for5mC,5hmC, and optionallymodCcontexts, across all specified groups with an optional context depth filter applied. By default, the command generates two sets of results: one with overdispersion correction (_with_od_*.bed) and one without (_without_od_*.bed), stored in a single timestampedDMR_directory for easy comparison. This command generates separate.bedfiles for each methylation context and group pair. See Calling differentially methylated regions (DMRs) for more information.The DMR results can then be visualised in modality XPLR Viewer (beta), including volcano plots, p-value distribution histograms and Manhattan plots. The Viewer (beta) allows DMR filtering by significance and export of the filtered results.
The Viewer (beta) also supports generation of track plots for individual DMRs, showing the methylation signal across samples in the region. Custom track plots can be configured further using the
.inifile generated with each DMR result, via the CLI commandmodality tracks.
The Core Workflow is well-suited for studies involving:
Disease vs control comparisons
Treatment vs untreated samples
Time-course experiments
Multi-group studies with covariates
Video: Running the modality XPLR Core Workflow#
Requirements and prerequisites#
Software requirements:
modality XPLR (version 1.1.0 or later is recommended, see Installation of modality XPLR)
Bash shell (version 4.0 or later)
Standard Unix utilities (awk, sed, find, etc.)
Data input and format requirements:
Zarr store(s): Generated by the upstream biomodal duet software, containing methylation data. Zarr stores must be readable by modality XPLR (generated by biomodal duet software v1.2 or later).
Sample metadata:
.csvor.tsvfile with sample information, group names and optional covariates (also referred to as the sample sheet). The metadata file (Sample sheet) must be formatted as described in the main documentation.
Core Workflow configuration#
For ease-of-use, the Core Workflow can be configured and downloaded using the Interactive Core Workflow generator tool below.
This generator allows you to specify input parameters, and then download a complete .zip folder containing the workflow script, configuration file, and pre-prepared gene and promoter regions files.
Core Workflow folder contents:
modality-core-workflow_YYYYMMDD_HHMMSS/
├── core-workflow-v2.0.sh # The Core Workflow script
├── config.txt # Configuration file for modality XPLR commands
├── Regions/ # Directory containing pre-prepared region files
│ ├── gencode.v44.human.genes.annotation.v1.1.0.bed.gz
│ └── gencode.v44.human.promoters.annotation.v1.1.0.bed.gz
├── README.txt # Usage instructions
└── Output/ # Output directory(created during execution)
Interactive Core Workflow generator#
🔧 Core Workflow generator
Generate the modality XPLR Core Workflow .zip folder with a standardised directory structure. The folder name is timestamped upon creation and includes the workflow script, configuration file, and pre-prepared genomic region files. The displayed form values below are examples only, and must be replaced in order to populate the config.txt. After generating the .zip folder, a preview panel will appear. Check that your settings have been written correctly. * denotes required fields
Running the Core Workflow#
To run the Core Workflow, follow these steps:
Download the Core Workflow .zip folder to your working location and extract it.
In the terminal, run the following lines.
# Change to the Core Workflow directory
cd /path/to/modality-core-workflow_YYYYMMDD_HHMMSS
# Make the script executable
chmod +x core-workflow-v2.0.sh
# Run the Core Workflow script
./core-workflow-v2.0.sh
Monitor the output logs for progress and any potential errors.
Advanced Core Workflow usage#
The config.txt file contains all the parameters for running the Core Workflow script. You should not need to modify this file, as it is pre-configured by the Interactive Core Workflow generator.
Advanced users can modify the config.txt file to change parameters such as the Zarr store path, metadata file, group column, covariates, and minimum depth filter. In addition, the Output directory path can be changed to a different location if required.
This may be useful if you want to re-run the analysis with different parameters (e.g. context depth filter, or covariates), without needing to download the Core Workflow folder again.
It is also possible to provide additional region files in the Regions/ directory, or to change the region files used by the Core Workflow.
Refer to the main documentation for the required format of the Region files.
Apart from the Biological QC HTML report (which operates on Zarr store(s)), the Core Workflow will perform all analysis steps on each region file found in the Regions/ directory.
For advanced users who are familiar with bash scripting, a description of the Core Workflow code is provided in the Advanced analysis workflow section, which can be used to customise the workflow further.
Interpreting Core Workflow outputs#
The Core Workflow generates a comprehensive set of outputs organised into structured directories. Understanding this organisation is crucial for interpreting results and sharing findings. Each analysis produces timestamped outputs and provenance metadata for traceability and reproducibility of the results.
The workflow organises outputs with BioQC_ (timestamped biological QC results) directly under the main output directory, and region-specific subdirectories (one per region file) that each contain both Extract_ (feature extraction results) and DMR_ (DMR calling results) timestamped subdirectories. When overdispersion mode is set to both (default), each DMR analysis produces two BED files with _with_od_* and _without_od_* suffixes (followed by timestamps) for easy comparison.
Video: Reviewing Core Workflow results#
Output directory structure#
A modality_cli.log file is generated in the working directory and contains logging information relating to the modality XPLR CLI commands only (not the Core Workflow bash outputs).
This file is generated automatically by modality XPLR and is not directly controlled by the Core Workflow script. Other outputs are described below.
Output/
├── BioQC_YYYYMMDD_HHMMSS/ # Biological QC results directory
│ ├── biological_qc_report_*.html # Biological QC HTML report
│ └── modality_metadata_YYYYMMD_HHMMSS.json # Provenance metadata file
├── <region_file_name_1>/ # Results for first region bedfile (e.g. genes)
│ ├── Extract_YYYYMMDD_HHMMSS/ # Feature extraction results directory
│ │ ├── Extract_*.tsv.gz # Feature extraction results file (compressed)
│ │ ├── Extract_*.html # Violin and Correlation plots (mean and regional-frac operations only)
│ │ └── modality_metadata_YYYYMMD_HHMMSS.json # Provenance metadata file
│ └── DMR_YYYYMMDD_HHMMSS/ # DMR analysis results directory
│ ├── DMR_*.bed # DMR result files (see note below)
│ ├── DMR_*.ini # Configuration files for making track plots with the 'tracks' CLI command
│ └── modality_metadata_YYYYMMD_HHMMSS.json # Provenance metadata file
├── <region_file_name_2>/ # Results for second region bedfile (e.g. promoters)
│ ├── Extract_YYYYMMDD_HHMMSS/
│ └── DMR_YYYYMMDD_HHMMSS/
└── <region_file_name_N>/ # Results for Nth region bedfile
├── Extract_YYYYMMDD_HHMMSS/
└── DMR_YYYYMMDD_HHMMSS/
Note
DMR filename conventions vary by overdispersion correction setting:
--overdispersion both(Core Workflow default): Produces two files per analysis with patterns_with_od_*.bed(with correction) and_without_od_*.bed(without correction), where * represents the timestamp--overdispersion trueor--overdispersion false: Produces one file per analysis namedDMR_*.bed(no suffix). The overdispersion setting is recorded in the provenance metadata JSON file.
Reviewing results#
Initiate modality XPLR Viewer (beta)
Start a Viewer session by invoking the
modality viewercommand in the terminal.In the command, specify the root directory (
-r) as the main output directory of the Core Workflow.Navigate to the indicated localhost URL in your web browser to access the Viewer interface.
Biological QC
Click on the tab for BioQC Results and open the analysis card for the Biological QC report.
Check sample clustering in the genome-wide PCA analysis. Use the colour-based grouping options directly on the report page to visualise sample relationships. Groups are defined by the metadata (sample sheet) file.
Verify expected grouping patterns. Does the data make biological sense for your experimental design?
Identify any outlier samples.
Assess overall sample mean CpG coverage in the Sample Information table. You can optionally export the metadata table including the per-sample CpG strand-merged coverage information.
Click on the tab for Extract Results, and filter to the region-specific analysis cards containing the modification types of interest.
Review the region-specific feature extraction plots (violin plots and correlation heatmaps) for mean CpG coverage, and 5mC and 5hmC fractions (and modC if included in the analysis). These visualisations provide insights into feature distributions and sample correlations.
Biological insights
Click on the tab for DMR Results, then filter to the region-specific analysis cards containing the group comparisons and modification types of interest.
By default, the Core Workflow runs with overdispersion mode set to
both, which produces two BED files for each analysis: one with overdispersion correction (_with_od_*.bed) and one without (_without_od_*.bed). You can toggle between these results on the analysis card in the Viewer. The overdispersion correction is designed to improve p-value distributions and DMR detection in datasets with high variability or small sample sizes. We recommend reviewing results with overdispersion correction first, then comparing to results without correction. Be sure to check the p-value distributions to determine which setting is most appropriate for your data.The DMR comparison report shows all selected analyses on one page. To view a detailed DMR Report (including p-value distribution and Manhattan plot) for a single analysis, click the report icon above the respective volcano plot.
Apply significance filters and plot configurations via the available settings. Export images and filtered DMR result files as needed.
Note the region annotations in the DMR results table, which can be used to gain further biological insights into the DMRs.
Select a single DMR to see
tracksplot options. Alternatively use the accompanying.inifiles to generate custom tracks plots using themodality tracksCLI command.
Validate findings
Check for consistency with known biology
Consider experimental design and potential confounders
Compare overdispersion-corrected vs uncorrected results (both provided when using
--overdispersion both)Use track plots to validate DMRs in surrounding context.
Plan follow-up validation experiments or to increase sample size if necessary
Common result patterns and interpretation#
Healthy DMR patterns:
Detection of significant DMRs between experimental groups
Consistent p-value distributions (uniform with some enrichment near zero)
Biologically meaningful effect sizes
Concordant patterns between 5mC and 5hmC where expected
Potential issues to watch for:
Poor group separation: May indicate experimental design issues
Skewed p-value distributions: May benefit from Overdispersion correction or other statistical adjustments
Very large effect sizes: May indicate technical artifacts
Inconsistent patterns: Could suggest batch effects or confounders
Biological interpretation guidelines:
Promoter hypermethylation: Often associated with deactivated regions
5hmC enrichment: Often associated with active demethylation (transition to active regions), or a marker of active gene bodies.
Concordant changes: 5mC loss with 5hmC gain may suggest active demethylation
Advanced analysis workflow#
This section provides a more detailed description of the Core Workflow bash script, which may be beneficial to users who want to modify the script directly or build their own workflows.
The direct download link to the Core Workflow script is provided below, which provides the .sh file only.
Download: core-workflow-v2.0.sh - Core Workflow script assembled from all code sections below.
Download a test dataset#
You can download a test folder containing mouse ES-E14 data to run the Core Workflow. The outputs are not intended to reveal biological insight, as the samples are technical replicates. However, it can be used to test that the script runs to completion.
The E14-example test folder contains:
A sample metadata file
E14_chromosome_19.tsvA zarr store
E14_chromosome_19.zarrzcontaining methylation dataA directory
Regionscontaininggencode.vM25.mouse.promoters.annotation.bedA
config.txtfile with example parameters
Download: E14-example.zip
Note
The E14-example.zip does not contain the core-workflow-v2.0.sh script. This will need to be downloaded separately and run from within the E14-example directory after extraction.
Advanced configuration file setup#
The Core Workflow uses a simple config.txt file for all settings. This design separates configuration from code, making it easy to run the same pipeline with different datasets or parameters.
If you are new to modality XPLR or have not yet run the Core Workflow, we suggest using the Interactive Core Workflow generator to get started.
Configuration file:
This file must be present in your working directory before running the pipeline, with the name config.txt.
Alternatively use the Advanced interactive configuration file generator below to create this file with additional options.
The aim of the config.txt is to pre-define all parameters required by the workflow script. If you modify the Core Workflow script to include additional options or settings, these can be added to the config.txt as variables.
For the Core Workflow, the config.txt file should contain the following parameters (parameter values are examples):
Zarr=/path/to/data.zarrz
Metadata=/path/to/metadata.csv
Group_Column=label
Condition_Order=condition1,condition2
Covariates=age,batch,sex
QC_Region=chr1
Regions_Directory=/path/to/regions
Main_Output=/path/to/output_dir
Depth_Filter=5
Include_modC=true
Overdispersion=both
Configuration parameters explained:
Zarr: Path to zarr store(s). Use space separation for multiple paths
Metadata: Path to
.csvor.tsvsample sheet file containing sample metadataGroup_Column: Column name in the metadata file, for grouping samples (e.g., “treatment” or “condition”)
Covariates: Space-separated list of covariate columns (optional) (e.g., “age batch sex race”)
QC_Region: Region specification for Biological QC analysis (optional) (e.g., “chr1” or “chr1:1000-2000”)
Regions_directory: Directory containing BED/TSV files in BED3+ format, with genomic regions and annotations. See Region files for details
Main_Output: Base directory where all results will be written
Depth_Filter: Minimum depth filter for analyses (leave it blank to disable filtering)
Include_modC: Include modC modification type in analyses (“true” or “false”, optional, default: true)
Overdispersion: Overdispersion mode for DMR analysis: “true”, “false”, or “both” (optional, Core Workflow default: both). When set to “both”, produces two BED files per analysis—one with and one without overdispersion correction
For a description of Sample sheet and Region files formats, see the linked modality XPLR documentation sections.
The config.txt can be generated using the Advanced interactive configuration file generator below or created manually in a text editor.
Advanced interactive configuration file generator#
🔧 Advanced Configuration File Generator
Use this form to generate and download your the config.txt file. A download link for the core-workflow-v2.0.sh is also provided. The displayed values are indications only, and must be replaced in order to populate the config.txt. After generating the file, a preview panel will appear. Check that your settings have been written correctly.
Core Workflow overview: step-by-step#
The Core Workflow script is organised into distinct sections that can be modified independently:
Setup and Initialisation: Script configuration and logging setup
Input Validation: Comprehensive checks of all inputs before execution
Analysis Steps: Three sequential analysis stages, designed to deliver Biological QC and Biological Insight analyses.
Script setup and configuration#
The script begins with essential setup including logging configuration and error handling.
Note
This code block includes instructions for enabling logging to a file, which may be useful for debugging and tracking script execution.
To enable, uncomment (remove #) from the code lines as directed.
#!/usr/bin/env bash
###############################################################################
# modality XPLR Core Workflow Pipeline
#
# This script runs a comprehensive analysis pipeline including:
# - Biological QC
# - Feature extraction
# - DMR calling
#
# Input requirements:
# - config.txt file in current directory with required parameters
# - Zarr store(s) containing methylation data
# - Sample metadata CSV or TSV file
# - Directory with region BED/TSV files
#
# This script is intended for running on modality XPLR version 1.1.0 or later.
# Results are designed to support visualisations in modality Viewer (beta).
#
###############################################################################
#------------------------------------------------------------------------------
# CONFIGURATION
#------------------------------------------------------------------------------
# Set script behaviour
set -euo pipefail
# Record start time for performance tracking
start_time=$(date +%s)
# Log file setup - uncomment to enable logging to file
# LOG_FILE="./modality_pipeline_$(date +%Y%m%d_%H%M%S).log"
# exec > >(tee -a "$LOG_FILE") 2>&1
Helper functions#
These logging functions provide consistent feedback throughout the pipeline execution.
#------------------------------------------------------------------------------
# HELPER FUNCTIONS
#------------------------------------------------------------------------------
# Logging functions
log_info() {
echo "[INFO] $(date '+%Y-%m-%d %H:%M:%S') - $1"
}
log_warning() {
echo "[WARNING] $(date '+%Y-%m-%d %H:%M:%S') - $1" >&2
}
log_error() {
echo "[ERROR] $(date '+%Y-%m-%d %H:%M:%S') - $1" >&2
}
log_debug() {
if [[ "${DEBUG:-false}" == "true" ]]; then
echo "[DEBUG] $(date '+%Y-%m-%d %H:%M:%S') - $1"
fi
}
Utility functions#
These utility functions handle common tasks like directory validation and report collection.
# Helper to trim whitespace
trim() {
awk '{$1=$1};1'
}
# Function to check if a header exists in metadata
header_exists() {
local col="$1"
local headers=("${@:2}")
for h in "${headers[@]}"; do
[[ "$h" == "$col" ]] && return 0
done
return 1
}
# Function to create directories with validation
create_directory() {
local dir="$1"
if [[ ! -d "$dir" ]]; then
mkdir -p "$dir" 2>/dev/null || { log_error "Cannot create directory $dir"; exit 1; }
fi
if [[ ! -w "$dir" ]]; then
log_error "No write access to directory $dir"
exit 1
fi
log_info "Directory ready: $dir"
}
Initialisation and configuration loading#
This section initialises the script and loads the configuration file. If you want to change the name of the input configuration file from the default config.txt, you can do so by changing the CONFIG_FILE variable in the script.
Similarly, if you want to control the verbosity of workflow logging, you can do so by changing the DEBUG variable in the script to true.
#------------------------------------------------------------------------------
# INITIALISATION
#------------------------------------------------------------------------------
printf "\n=======================================\n"
printf "STEP 0 - CONFIGURATION & INITIALISATION\n"
printf "=======================================\n"
log_info "Starting modality XPLR Core Workflow Pipeline"
# Default values for variables that could be customised
CONFIG_FILE="./config.txt"
DEBUG=false # Set to true to enable debug output
The next part will read the configuration file and set values for all required parameters. If you wish to change existing parameters values, you can do so by editing the configuration (config.txt) file in your working directory.
If you want to add new variables to be used in downstream modality XPLR commands, ensure they are defined here and then prompted to assign a value when reading in the configuration file.
For any variables that should be hard-coded, the value can be set directly in the script and excluded from the configuration file.
#------------------------------------------------------------------------------
# LOAD CONFIGURATION
#------------------------------------------------------------------------------
log_info "Reading configuration from $CONFIG_FILE"
# Validate config file exists
if [[ ! -f "$CONFIG_FILE" ]]; then
log_error "config.txt not found in the current working directory: $(pwd)"
log_error "Please ensure config.txt is present before running this script."
exit 1
fi
# Read config.txt variables
ZARR="" # Path(s) to zarr store(s), comma-separated
METADATA="" # Path to sample metadata CSV or TSV file
GROUP_COLUMN="" # Column name in metadata for grouping samples
COVARIATES="" # Space-separated list of covariate column names (optional)
CONDITION_ORDER="" # Comma-separated order of conditions for DMR analysis (optional)
QC_REGION="" # Region for Biological QC analysis (optional)
REGIONS_DIR="" # Directory containing region BED/TSV files
MAIN_OUTPUT="" # Base output directory
DEPTH_FILTER="" # Minimum depth filter for analyses, e.g. 10 (optional)
INCLUDE_MODC="" # Include modC in analyses: "true" or "false" (optional, default: true)
OVERDISPERSION="" # Overdispersion mode: "true", "false", or "both" (optional, Core Workflow default: both)
# Parse config file
while IFS='=' read -r key value || [[ -n "$key" ]]; do
key=$(echo "$key" | trim)
value=$(echo "$value" | trim)
case "$key" in
Zarr) ZARR="$value" ;;
Metadata) METADATA="$value" ;;
Group_Column) GROUP_COLUMN="$value" ;;
Covariates) COVARIATES="$value" ;;
Condition_Order) CONDITION_ORDER="$value" ;;
QC_Region) QC_REGION="$value" ;;
Regions_Directory) REGIONS_DIR="$value" ;;
Main_Output) MAIN_OUTPUT="$value" ;;
Depth_Filter) DEPTH_FILTER="$value" ;;
Include_modC) INCLUDE_MODC="$value" ;;
Overdispersion) OVERDISPERSION="$value" ;;
# You could add additional parameters here as needed
esac
done < "$CONFIG_FILE"
# Log parsed configuration
printf "\n\n>>Provided inputs\n"
log_info "ZARR path(s): $ZARR"
log_info "Metadata file: $METADATA"
log_info "Group column: $GROUP_COLUMN"
log_info "Covariates: $COVARIATES"
log_info "Condition order: $CONDITION_ORDER"
log_info "QC region: $QC_REGION"
log_info "Regions directory: $REGIONS_DIR"
log_info "Main output directory: $MAIN_OUTPUT"
log_info "Depth filter: $DEPTH_FILTER"
log_info "Include modC in analyses: $INCLUDE_MODC"
log_info "Overdispersion mode: $OVERDISPERSION"
Input validation#
This section validates all inputs and ensures they exist and are accessible.
Use this section to add any custom validation checks for your specific workflow requirements, file paths, and access permissions, according to the variables defined in the config.txt file.
#------------------------------------------------------------------------------
# VALIDATE INPUTS
#------------------------------------------------------------------------------
printf "\n\n>>Validation & directory creation\n"
log_info "Validating inputs..."
# Parse and validate zarr paths
if [[ -n "$ZARR" ]]; then
# Split comma-separated zarr paths into array
IFS=',' read -r -a ZARR_RAW_ARRAY <<< "$ZARR"
# Process each zarr path to remove quotes and trim whitespace
ZARR_ARRAY=()
for zarr_raw in "${ZARR_RAW_ARRAY[@]}"; do
# Trim leading/trailing whitespace
zarr_clean=$(echo "$zarr_raw" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
# Remove surrounding single or double quotes if present
zarr_clean=$(echo "$zarr_clean" | sed 's/^["'"'"']\(.*\)["'"'"']$/\1/')
# Add to processed array if not empty
if [[ -n "$zarr_clean" ]]; then
ZARR_ARRAY+=("$zarr_clean")
fi
done
# Log the number of zarr paths identified and their strings
log_info "Identified ${#ZARR_ARRAY[@]} zarr path(s):"
for i in "${!ZARR_ARRAY[@]}"; do
log_info " [$((i+1))] ${ZARR_ARRAY[i]}"
done
# Validate each zarr path
for zarr in "${ZARR_ARRAY[@]}"; do
if [[ ! -s "$zarr" ]]; then
log_error "Zarr file '$zarr' doesn't contain any data or is not readable"
exit 1
fi
done
else
log_error "No zarr paths specified in config.txt"
exit 1
fi
# Process and validate metadata file path
if [[ -n "$METADATA" ]]; then
# Trim leading/trailing whitespace
METADATA=$(echo "$METADATA" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
# Remove surrounding single or double quotes if present
METADATA=$(echo "$METADATA" | sed 's/^["'"'"']\(.*\)["'"'"']$/\1/')
log_info "Processed metadata file path: $METADATA"
else
log_error "No metadata file specified in config.txt"
exit 1
fi
# Process and validate main output directory path
if [[ -n "$MAIN_OUTPUT" ]]; then
# Trim leading/trailing whitespace
MAIN_OUTPUT=$(echo "$MAIN_OUTPUT" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
# Remove surrounding single or double quotes if present
MAIN_OUTPUT=$(echo "$MAIN_OUTPUT" | sed 's/^["'"'"']\(.*\)["'"'"']$/\1/')
log_info "Processed main output directory: $MAIN_OUTPUT"
else
log_error "No main output directory specified in config.txt"
exit 1
fi
# Process and validate regions directory path
if [[ -n "$REGIONS_DIR" ]]; then
# Trim leading/trailing whitespace
REGIONS_DIR=$(echo "$REGIONS_DIR" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
# Remove surrounding single or double quotes if present
REGIONS_DIR=$(echo "$REGIONS_DIR" | sed 's/^["'"'"']\(.*\)["'"'"']$/\1/')
log_info "Processed regions directory: $REGIONS_DIR"
else
log_error "No regions directory specified in config.txt"
exit 1
fi
# Check metadata file
log_info "Validating metadata file headers..."
if [[ ! -r "$METADATA" ]]; then
log_error "Metadata file not found or not readable: $METADATA"
exit 1
fi
# Check group column and covariates in metadata
HEADER_LINE=$(head -n 1 "$METADATA")
# Remove any line ending characters (Windows \r\n, Mac \r, Unix \n)
HEADER_LINE=$(echo "$HEADER_LINE" | tr -d '\r\n')
# Detect delimiter: use tab if present, otherwise comma
if [[ "$HEADER_LINE" == *$'\t'* ]]; then
DELIM=$'\t'
else
DELIM=','
fi
# Split header into array using the detected delimiter
IFS="$DELIM" read -r -a HEADERS_RAW <<< "$HEADER_LINE"
# Trim whitespace from each header individually
HEADERS=()
for header in "${HEADERS_RAW[@]}"; do
# Trim leading and trailing whitespace from each header
header_clean=$(echo "$header" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
if [[ -n "$header_clean" ]]; then
HEADERS+=("$header_clean")
fi
done
if [[ -n "$GROUP_COLUMN" ]]; then
if ! header_exists "$GROUP_COLUMN" "${HEADERS[@]}"; then
log_error "Group column '$GROUP_COLUMN' not found in metadata file header. Please review the metadata file and provide a valid group column."
exit 1
fi
fi
if [[ -n "$COVARIATES" ]]; then
# Convert space-separated list to array
read -r -a COVS <<< "$COVARIATES"
for cov in "${COVS[@]}"; do
cov=$(echo "$cov" | trim)
if ! header_exists "$cov" "${HEADERS[@]}"; then
log_error "Covariate column '$cov' not found in metadata file header"
exit 1
fi
done
fi
log_info "Metadata validation passed"
# Check regions directory
if [[ ! -d "$REGIONS_DIR" ]]; then
log_error "Regions directory not found: $REGIONS_DIR"
exit 1
fi
Output directory setup#
This section creates region-specific subdirectories in MAIN_OUTPUT to organise the results.
All results for each modality XPLR CLI operation (including provenance metadata) will be written to timestamped subdirectories (Extract_* and DMR_*) within the corresponding region-specific folder.
All modality XPLR outputs are timestamped, however you may wish to modify this section to add additional sub-directories for specific analyses, or to change the output directory structure.
#------------------------------------------------------------------------------
# SETUP OUTPUT DIRECTORIES
#------------------------------------------------------------------------------
log_info "Creating output directories under $MAIN_OUTPUT..."
# Create main output directory
create_directory "$MAIN_OUTPUT"
# Create region-specific subdirectories directly in the main output directory
log_info "Creating region-specific subdirectories..."
for bedfile in "$REGIONS_DIR"/*.bed "$REGIONS_DIR"/*.bed.gz "$REGIONS_DIR"/*.tsv "$REGIONS_DIR"/*.tsv.gz; do
# Skip if no files match pattern
[[ -e "$bedfile" ]] || continue
[[ "${METADATA##*.}" == "tsv" && $(realpath "$bedfile") == $(realpath "$METADATA") ]] && continue
# Extract region file name without path and extension
region_name=$(basename "$bedfile")
region_name="${region_name%%.bed.gz}"
region_name="${region_name%%.bed}"
region_name="${region_name%%.tsv.gz}"
region_name="${region_name%%.tsv}"
# Create region-specific subdirectory in main output directory
create_directory "$MAIN_OUTPUT/$region_name"
log_info "Created region directory for: $region_name"
done
Step 1: Biological QC HTML report#
The first analysis step performs biological quality control to provide Zarr store summaries, and to identify sample relationships.
What this script block does:
Generates correlation matrices between samples for 5mC and 5hmC (and modC if included)
Creates PCA plots to visualise sample clustering for each methylation context (grouping options are available on the Viewer report page)
Calculates mean CpG coverage for each sample
Key parameters:
Uses all zarr stores specified in configuration (
config.txt)Allows sample metadata to be used for dynamic colour-based grouping, when available
Note
The Core Workflow does not include the --aggregate-by-group option, which could be included to aggregate sample data for technical replicates prior to QC.
#------------------------------------------------------------------------------
# STEP 1: BIOLOGICAL QC
#------------------------------------------------------------------------------
printf "\n\n======================================\n"
printf "STEP 1: RUNNING BIOLOGICAL QC ANALYSES\n"
printf "======================================\n"
# Build region argument if QC_REGION is set
REGION_ARG=""
[[ -n "$QC_REGION" ]] && REGION_ARG="--region $QC_REGION"
# Set default for Include_modC if not specified
[[ -z "$INCLUDE_MODC" ]] && INCLUDE_MODC="true"
# Build methylation contexts based on Include_modC setting
if [[ "$INCLUDE_MODC" == "true" ]]; then
MODC_CONTEXTS="mc hmc modc"
else
MODC_CONTEXTS="mc hmc"
fi
# Run QC without aggregation
log_info "Running biological QC without aggregation..."
modality -v biological-qc \
--zarr-path "${ZARR_ARRAY[@]}" \
--sample-sheet "$METADATA" \
$REGION_ARG \
--methylation-contexts $MODC_CONTEXTS \
--order-by-group "$GROUP_COLUMN" \
--output-dir "$MAIN_OUTPUT"
log_info "Biological QC completed"
Expected outputs:
biological_qc_report.html: HTML report with sample-wise plots and statistics.Provenance metadata
.jsonfiles, stored in each Results/BioQC_YYYYMMDD_HHMMSS sub-directory that is generated by each modality XPLR CLI command.
Interpreting QC results:
Correlation heatmaps: Correlation coefficients close to 1 indicate high similarity of 5mC and 5hmC (and modC if included) levels between samples
PCA plots: Separation and/or clustering between experimental groups indicates the % variability that is explained by the first two principal components, for 5mC and 5hmC (and modC if included).
Coverage statistics: The mean CpG coverage per sample is an indicator of the usable depth for methylation analysis. See the next step for feature extraction (region-based) coverage statistics.
Outlier detection: Samples that don’t cluster with their expected group may need investigation. Removing a sample from the metadata (sample sheet) file will exclude it from future analysis, without needing to modify the original Zarr file.
Step 2: Region-based Biological QC#
This step used the modality get tool to extract methylation features (statistics) from user-specified genomic regions, for additional quality control and analysis.
The regions can be defined in BED3+ format (.bed` or ``.tsv) files, that are accessible in the REGIONS_DIR, defined in config.txt.
For the Core Workflow script to run correctly, region files must be present. Pre-prepared BED files are provided to get started, or custom files following the same format can be used instead. Use the correct region files for the reference genome used for alignment in the duet Software.
Tip
The Annotation (e.g. gene, or promoter) and Name (e.g. associated gene name) columns should be completed for meaningful interpretation.
The pipeline performs multiple types of statistics extraction to provide comprehensive regional methylation data.
Each statistic type is processed separately by modality XPLR, and the results are stored in timestamped Extract_* subdirectories within each region-specific folder.
Note
Only modality get mean and modality get regional-frac commands support plotting (correlation or violin), which are useful for visualising the distribution of methylation levels across samples.
If no HTML report is generated, then the results will only be written as .tsv.gz files within the respective Extract_* timestamped subdirectories.
What this script block does:
Count extraction: Returns the total number of CpG contexts (
total_c) for each region.Mean calculation: Average coverage of CpG contexts (
total_c) for each region, with violin and correlation plots for individual samples.Regional fraction: Methylation fractions for 5mC and 5hmC (and modC if included) at the pre-specified minimum``DEPTH_FILTER`` value, with violin and correlation plots for individual samples.
Processes each region file in the
REGIONS_DIRindependently. Therefore, expect individual results for each region file and extraction type.
#------------------------------------------------------------------------------
# STEP 2: FEATURE EXTRACTION
#------------------------------------------------------------------------------
printf "\n\n==================================================\n"
printf "STEP 2: RUNNING FEATURE EXTRACTION ON REGION FILES\n"
printf "==================================================\n"
# Process each region file for feature extraction
for bedfile in "$REGIONS_DIR"/*.bed "$REGIONS_DIR"/*.bed.gz "$REGIONS_DIR"/*.tsv "$REGIONS_DIR"/*.tsv.gz; do
# Skip if no files match pattern
[[ -e "$bedfile" ]] || continue
[[ "${METADATA##*.}" == "tsv" && $(realpath "$bedfile") == $(realpath "$METADATA") ]] && continue
log_info "Processing region file: $bedfile"
# Extract region file name without path and extension
region_name=$(basename "$bedfile")
region_name="${region_name%%.bed.gz}"
region_name="${region_name%%.bed}"
region_name="${region_name%%.tsv.gz}"
region_name="${region_name%%.tsv}"
# Set region-specific output directory
region_output_dir="$MAIN_OUTPUT/$region_name"
# 1. Extract count for total_c
log_info "Running feature extraction: count total_c..."
modality -v get count \
--zarr-path "${ZARR_ARRAY[@]}" \
--sample-sheet "$METADATA" \
--bedfile "$bedfile" \
--fields total_c \
--output-dir "$region_output_dir"
# 2. Extract mean for total_c with plots generated automatically
log_info "Running feature extraction: mean total_c (plots generated automatically)..."
modality -v get mean \
--zarr-path "${ZARR_ARRAY[@]}" \
--sample-sheet "$METADATA" \
--bedfile "$bedfile" \
--fields total_c \
--order-by-group "$GROUP_COLUMN" \
--output-dir "$region_output_dir"
# 3. Extract regional-frac for mc, hmc, and optionally modc with depth filter (plots generated automatically)
if [[ "$INCLUDE_MODC" == "true" ]]; then
log_info "Running feature extraction: regional-frac for mc, hmc, and modc with minimum depth $DEPTH_FILTER (plots generated automatically)..."
REGIONAL_FRAC_FIELDS="mc hmc modc"
else
log_info "Running feature extraction: regional-frac for mc and hmc with minimum depth $DEPTH_FILTER (plots generated automatically)..."
REGIONAL_FRAC_FIELDS="mc hmc"
fi
# Build the optional argument if DEPTH_FILTER is set
MCV_ARG=""
[[ -n "$DEPTH_FILTER" ]] && MCV_ARG="--min-coverage $DEPTH_FILTER"
# --min-coverage 30, if DEPTH_FILTER=30 in config.txt
modality -v get regional-frac \
--zarr-path "${ZARR_ARRAY[@]}" \
--sample-sheet "$METADATA" \
--bedfile "$bedfile" \
--fields $REGIONAL_FRAC_FIELDS \
$MCV_ARG \
--order-by-group "$GROUP_COLUMN" \
--output-dir "$region_output_dir"
log_info "Feature extraction completed for $region_name"
done
log_info "Feature extraction completed for all region files"
Expected outputs:
Extract_*directories: Organised by statistic type and region file*.tsv.gzfiles: Compressed quantitative results for each extracted statistic, by region and sample*.htmlreports: Interactive violin and correlation plots showing methylation statistics formeanandregional-fracextraction types.
Note
The optional regional dendrogram heatmap is not included in the Core Workflow script by default but can be added to the HTML report by modifying the modality get mean and/or modality get regional-frac command(s) to include --heatmap. Note that the heatmap will only be drawn if there are ≤200 regions in the region file, so it is not appropriate for use with all promoters and gene bodies. Instead, add a custom subset region file to the Regions directory before running the Core Workflow.
Alternative tools for generating heatmaps are available in modality Viewer (beta).
Interpreting region-based Biological QC statistics:
Region-specific CpG coverage: Should show adequate and consistent coverage across samples and may be used to inform the
DEPTH_FILTERvalue for future analyses.Methylation fraction distributions: Violin and correlation plots reveal methylation heterogeneity.
Outlier samples: Samples with unusual distributions may need investigation.
Customising statistics extraction:
You can modify this section of the script to:
Add different field types (see Zarr store and Extracting methylation statistics sections in the main modality XPLR documentation).
Adjust minimum coverage thresholds for regional fraction extraction.
Include additional statistical summaries (
count,mean,sum,regional-frac).Add an optional
--heatmapparameter to generate regional dendrogram heatmaps formeanand/orregional-fracextraction types with ≤200 regions. Alternative tools for generating heatmaps are available in modality Viewer (beta).
Step 3: DMR calling#
This step identifies differentially methylated regions using statistical modelling, processing each region file individually. DMR calling is the core analytical component that identifies regions with significant methylation differences between two experimental groups.
What this script block does:
Processes each
.bedand.tsvfile in the regions directory separatelyPerforms statistical testing for differential methylation on each region file (see Calling differentially methylated regions (DMRs))
Analyses 5mC and 5hmC modifications separately (and modC if included)
Analyses all available condition pairs from the metadata, preserving the sample sheet order for use with the
--condition-orderparameter. Ensure the control group is the first one listed in the metadata file; or analyses only the conditions specified in theconfig.txt, with the left-most condition name as the control group.Applies multiple testing correction (FDR)
Incorporates covariates if specified
Note
The Core Workflow defaults to --overdispersion both, which runs the DMR model twice—once with and once without overdispersion correction—in a single invocation. Two BED result files are produced with _with_od_* and _without_od_* suffixes (followed by timestamps). This allows you to compare results and choose the most appropriate model for your data. For datasets with high variability or small sample sizes, overdispersion correction often improves statistical inference. For more information on when to use overdispersion, see Overdispersion correction.
Key analysis parameters:
Methylation contexts: Analyses
mcandhmc(andmodcif included)Depth filtering: Should be applied according to the sequencing depth and methylation context coverage, and specified in the
config.txtfileCovariate adjustment: Controls for confounding variables when specified in the
config.txtfile
#------------------------------------------------------------------------------
# STEP 3: DMR CALLING
#------------------------------------------------------------------------------
printf "\n\n===========================\n"
printf "STEP 3: RUNNING DMR CALLING\n"
printf "===========================\n"
# Build covariate / depth_filter / condition_order / overdispersion flags if present
COV_FLAGS=""
DEPTH_ARG=""
CONDITION_ARG=""
OVERDISPERSION_ARG=""
[[ -n "$DEPTH_FILTER" ]] && DEPTH_ARG="--filter-context-depth $DEPTH_FILTER"
[[ -n "$CONDITION_ORDER" ]] && CONDITION_ARG="--condition-order $CONDITION_ORDER"
if [[ -n "$COVARIATES" ]]; then
# Use space-separated covariates directly for --covariates flag
COV_FLAGS="--covariates $COVARIATES"
fi
# Set default for overdispersion if not specified
[[ -z "$OVERDISPERSION" ]] && OVERDISPERSION="both"
# Build overdispersion argument
OVERDISPERSION_ARG="--overdispersion $OVERDISPERSION"
log_info "Running DMR calling with overdispersion mode: $OVERDISPERSION"
# Run DMR calling on each region file individually
for bedfile in "$REGIONS_DIR"/*.bed "$REGIONS_DIR"/*.bed.gz "$REGIONS_DIR"/*.tsv "$REGIONS_DIR"/*.tsv.gz; do
# Skip if no files match pattern
[[ -e "$bedfile" ]] || continue
[[ "${METADATA##*.}" == "tsv" && $(realpath "$bedfile") == $(realpath "$METADATA") ]] && continue
log_info "Processing region file: $bedfile"
# Extract region file name without path and extension
region_name=$(basename "$bedfile")
region_name="${region_name%%.bed.gz}"
region_name="${region_name%%.bed}"
region_name="${region_name%%.tsv.gz}"
region_name="${region_name%%.tsv}"
# Set region-specific output directory
region_output_dir="$MAIN_OUTPUT/$region_name"
# Run DMR calling
# --covariates cov1 cov2, if COVARIATES="cov1 cov2" in config.txt
# --filter-context-depth 15, if DEPTH_FILTER=15 in config.txt
# --condition-order condition1,condition2, if CONDITION_ORDER="condition1,condition2" in config.txt
# --overdispersion both, if OVERDISPERSION="both" in config.txt (or default)
modality -v dmr call \
--zarr-path "${ZARR_ARRAY[@]}" \
--sample-sheet "$METADATA" \
--condition-array-name "$GROUP_COLUMN" \
--methylation-contexts $MODC_CONTEXTS \
--bedfile "$bedfile" \
$COV_FLAGS \
$DEPTH_ARG \
$CONDITION_ARG \
$OVERDISPERSION_ARG \
--output-dir "$region_output_dir"
log_info "DMR calling completed for $bedfile"
done
log_info "All DMR calling completed"
Expected outputs:
DMR_*directory: All DMR results (for all group pairs and contexts) are stored in a single directory with a timestamp*.bedfiles: DMR results in BED3+ format for each condition pair and methylation context*.inifiles: Configuration files for generating custom tracks plots via the modality XPLR CLI. One.inifile is generated for each DMR.bedresult file
Advanced customisation and extension#
The Core Workflow is designed to be easily adapted for different research needs. Here are common modifications and extensions:
Customising analysis parameters#
Adjusting coverage thresholds:
# In the config.txt or script
DEPTH_FILTER=15 # Increase for higher quality regions
# Or modify specific commands directly using the --min-coverage option
modality get regional-frac \
--zarr-path "${ZARR_ARRAY[@]}" \
--bedfile "$bedfile" \
--fields mc hmc modc \
--min-coverage 15 \ # Custom minimum coverage
--output-dir "$RESULTS_DIR"
Modify the methylation contexts in DMR calling:
# Modify the methylation contexts in DMR calling
modality dmr call \
--methylation-contexts mc hmc \ # Exclude modc if not needed
# ... other parameters
Note
The Core Workflow defaults to --overdispersion both, which produces two result files per analysis in the same timestamped DMR directory: one with overdispersion correction (_with_od_*.bed) and one without (_without_od_*.bed), where the suffix appears before the timestamp. This allows comparison of results under different statistical models. If --overdispersion true or false is specified instead, only one result file per analysis is produced without a suffix.
Performance optimisation tips#
For large datasets:
Use subset regions for initial testing
Implement parallel processing for independent steps
Monitor disk space and memory usage
Consider using HPC environments for large-scale analysis
For production use:
Add comprehensive logging and monitoring
Implement checkpointing for long-running analyses
Add email notifications for completion/failure
Include resource usage reporting
Script completion and monitoring#
The script includes basic completion reporting and performance tracking:
#------------------------------------------------------------------------------
# COMPLETION
#------------------------------------------------------------------------------
log_info "Pipeline completed successfully. Results are available in $MAIN_OUTPUT"
# Calculate and report duration
end_time=$(date +%s)
duration=$((end_time - start_time))
log_info "Script completed in $duration seconds"
# Exit cleanly
exit 0
Monitoring features:
Execution time tracking: Reports total pipeline duration
Success confirmation: Clear indication of successful completion
Output location reporting: Directs users to results
Clean exit handling: Proper script termination
Monitoring execution:
The script provides real-time logging output. You can also redirect output to a log file using the code below, or code it into the script as described in Script setup and configuration.
./core-workflow-v2.0.sh 2>&1 | tee workflow.log
Expected runtime:
Runtime varies significantly based on:
Dataset size: Number of samples and genomic regions
Analysis complexity: Number of methylation contexts and covariates
System resources: Available CPU and memory
Storage speed: Local vs network storage performance
This modular workflow provides a comprehensive foundation for methylation analysis using the modality XPLR toolkit. By understanding each component, you can adapt it to meet your specific research requirements while maintaining robust error handling and organised outputs.
Warning
Version compatibility: This pipeline is designed for modality XPLR version 1.1.0 and later. Results are optimised for visualisation in modality Viewer (beta). Some features may not be available in earlier versions. Check your modality XPLR version with modality --version before running.
Tip
Getting help: If you encounter issues, first try enabling debug mode (DEBUG=true) for detailed logging. The modular design allows you to run individual sections for troubleshooting specific steps.