Core Workflow#

The modality XPLR Core Workflow is a simple-to-use analysis tool that leverages bash scripting to automate key modality XPLR commands on biomodal data contained in the Zarr store. The Core Workflow is well-suited for exploration of pilot study data, or for new users who want rapid insights from their data through a standardised, low-code approach.

For experienced users who want a deep dive into the Core Workflow script, proceed directly to the Advanced analysis workflow documentation.

This section describes how to set up and run the modality XPLR Core Workflow. This workflow is designed for Human, biomodal 6-base datasets (including 5mC and 5hmC), that are aligned to genome reference hg38. The workflow is pre-configured to analyse gene bodies and promoter regions only. See the Pre-prepared BED files section in the main documentation for more information on how the region files were generated.

  • Gene bodies: gencode.v44.human.genes.annotation.bed.gz

  • Promoter regions: gencode.v44.human.promoters.annotation.bed.gz

The Core Workflow performs the following analyses and outputs:

Biological QC

  1. biological-qc to generate an HTML report summarising the dataset and sample information, with Pearson correlation and PCA plots for genome-wide 5mC and 5hmC data. See Biological QC for more information.

  2. get mean to generate an HTML report (violin and correlation plots) and a .tsv data file, showing the mean CpG coverage in gene bodies and promoter regions. Context depth filter is not applied to this command. See Extracting methylation statistics for more information on this and other get commands.

  3. get regional-frac to generate an HTML report (violin and correlation plots) and a .tsv data file, showing the fraction of 5mC and 5hmC (over total_c) in gene bodies and promoter regions, with an optional minimum depth filter applied.

  4. get count to generate a .tsv data file containing the total number of contexts (total_c) in each gene body and promoter region. Context depth filter is not applied to this command, as it is not applicable to the count command.

Biological insight

  1. dmr call to identify differentially methylated regions (DMRs) in gene bodies and promoter regions, for both 5mC and 5hmC contexts, across all specified groups with an optional context depth filter applied. This command generates separate .bed files for each methylation context and group pair. See Calling differentially methylated regions (DMRs) for more information.

  2. dmr plot to generate an HTML report for each DMR results file, showing summary information and unfiltered results as a volcano plot and p-value distributions. See Plotting differentially methylated regions (DMRs) for more information.

  3. dmr plot with filtering to generate an HTML report for each DMR results file, showing summary information and filtered results (q-value < 0.05) as a volcano plot and p-value distributions. When filtering is applied, a new DMR results .bed file is generated containing only the filtered DMRs for each region type and methylation context.

Each DMR analysis is also accompanied by an .ini configuration file that can be used for visualising DMR result with modality tracks, see Generate genomic track plots for more information.

The Core Workflow is well-suited for studies involving:

  • Disease vs control comparisons

  • Treatment vs untreated samples

  • Time-course experiments

  • Multi-group studies with covariates

Video: Running the modality XPLR Core Workflow#

Requirements and prerequisites#

Software requirements:

  • modality XPLR (version 1.0.0 or later is recommended, see Installation of modality XPLR)

  • Bash shell (version 4.0 or later)

  • Standard Unix utilities (awk, sed, find, etc.)

Data input and format requirements:

  • Zarr store(s): Generated by the upstream biomodal duet software, containing methylation data. Zarr stores must be readable by modality XPLR (generated by biomodal duet software v1.2 or later).

  • Sample metadata: .csv or .tsv file with sample information, group names and optional covariates (also referred to as the sample sheet). The metadata file (Sample sheet) must be formatted as described in the main documentation.

Core Workflow configuration#

For ease-of-use, the Core Workflow can be configured and downloaded using the Interactive Core Workflow generator tool below. This generator allows you to specify input parameters, and then download a complete .zip folder containing the workflow script, configuration file, and pre-prepared gene and promoter regions files.

Core Workflow folder contents:

modality-core-workflow_YYYYMMDD_HHMMSS/
├── core-workflow-v1.3.sh                              # The Core Workflow script
├── config.txt                                         # Configuration file for modality XPLR commands
├── Regions/                                           # Directory containing pre-prepared region files
│   ├── gencode.v44.human.genes.annotation.bed.gz
│   └── gencode.v44.human.promoters.annotation.bed.gz
├── README.txt                                         # Usage instructions
└── Output/                                            # Output directory(created during execution)

Interactive Core Workflow generator#

🔧 Core Workflow generator

Generate the modality XPLR Core Workflow .zip folder with a standardised directory structure. The folder name is timestamped upon creation and includes the workflow script, configuration file, and pre-prepared genomic region files. The displayed form values below are examples only, and must be replaced in order to populate the config.txt. After generating the .zip folder, a preview panel will appear. Check that your settings have been written correctly. * denotes required fields

Comma-separated paths, if multiple zarr stores are specified
Path to your sample metadata CSV or TSV file
Column name in your metadata CSV or TSV file for sample grouping
Space-separated order of conditions (control group first)
Space-separated covariate column names from your metadata
Select a region to limit the Biological QC to a specific region e.g. "chr" or "chr:<start>-<end>". This can help to speed up analysis and reduce memory usage.
Minimum context depth for regional-frac and DMR analysis e.g. 10. Leave blank to disable depth filtering.
Apply overdispersion correction for DMR calling.

Running the Core Workflow#

To run the Core Workflow, follow these steps:

  1. Download the Core Workflow .zip folder to your working location and extract it.

  2. In the terminal, run the following lines.

# Change to the Core Workflow directory
cd /path/to/modality-core-workflow_YYYYMMDD_HHMMSS

# Make the script executable
chmod +x core-workflow-v1.3.sh

# Run the Core Workflow script
./core-workflow-v1.3.sh
  1. Monitor the output logs for progress and any potential errors.

Advanced Core Workflow usage#

The config.txt file contains all the parameters for running the Core Workflow script. You should not need to modify this file, as it is pre-configured by the Interactive Core Workflow generator. Advanced users can modify the config.txt file to change parameters such as the Zarr store path, metadata file, group column, covariates, and minimum depth filter. In addition, the Output directory path can be changed to a different location if required. This may be useful if you want to re-run the analysis with different parameters (e.g. context depth filter, or covariates), without needing to download the Core Workflow folder again.

It is also possible to provide additional region files in the Regions/ directory, or to change the region files used by the Core Workflow. Refer to the main documentation for the required format of the Region files. Apart from the Biological QC HTML report (which operates on Zarr store(s)), the Core Workflow will perform all analysis steps on each region file found in the Regions/ directory.

For advanced users who are familiar with bash scripting, a description of the Core Workflow code is provided in the Advanced analysis workflow section, which can be used to customise the workflow further.

Interpreting Core Workflow outputs#

The Core Workflow generates a comprehensive set of outputs organised into structured directories. Understanding this organisation is crucial for interpreting results and sharing findings. Each analysis produces timestamped outputs and provenance metadata for traceability and reproducibility of the results.

The workflow organises outputs by creating a subdirectory within both Results/ and Reports/ for each region file processed. This means that all analysis results (feature extraction, DMR calling, and DMR visualisation) related to a specific region file (e.g., genes, promoters) are grouped together in their respective subdirectories. Similarly, all HTML reports related to a specific region file are organised in matching subdirectories within the Reports/ directory. The Biological QC results, which are not related to a specific region file input, are stored in a BioQC_ timestamped directory in Results/ and the QC HTML report is placed directly in the main Reports/ directory.

Video: Reviewing Core Workflow results#

Output directory structure#

A modality_cli.log file is generated in the working directory and contains logging information relating to the modality XPLR CLI commands only (not the Core Workflow bash outputs). This file is generated automatically by modality XPLR and is not directly controlled by the Core Workflow script. Other outputs are described below.

Output/
├── Results/                                              # All analysis outputs
│   ├── BioQC_YYYYMMDD_HHMMSS/                            # Biological QC results
│   │   ├── biological_qc_report_*.html                   # Biological QC HTML report
│   │   └── modality_metadata_YYYYMMD_HHMMSS.json         # Provenance metadata file
│   ├── <region_file_name_1>/                             # Results for first region file (e.g. genes)
│   │   ├── Extract_YYYYMMDD_HHMMSS/                      # Feature extraction results
│   │   │   ├── Extract_*.tsv.gz                          # Feature extraction results file (compressed)
│   │   │   ├── Extract_*.html                            # Violin and Correlation plots (mean and regional-frac operations only)
│   │   │   └── modality_metadata_YYYYMMD_HHMMSS.json     # Provenance metadata file
│   │   ├── DMR_YYYYMMDD_HHMMSS/                          # DMR analysis results
│   │   │   ├── DMR_*.bed                                 # DMR result files
│   │   │   ├── DMR_*.ini                                 # Track configuration files (track plots are not included in this script)
│   │   │   └── modality_metadata_YYYYMMD_HHMMSS.json     # Provenance metadata file
│   │   └── DMR_Report_YYYYMMDD_HHMMSS/                   # DMR visualisation reports
│   │       ├── DMR_Report_*.html                         # Unfiltered DMR visualisation report
│   │       ├── DMR_Report_*.max_q_0_05_*.html            # Filtered DMR visualisation report
│   │       ├── DMR_*_qval_0_05_filtered_*.bed.gz         # Filtered DMR results file (compressed)
│   │       └── modality_metadata_YYYYMMD_HHMMSS.json     # Provenance metadata file
│   ├── <region_file_name_2>/                             # Results for second region file (e.g. promoters)
│   │   └── ... (same structure as above)
│   └── <region_file_name_N>/                             # Results for Nth region file (e.g. if you add additional region files to the Regions/ directory)
│       └── ... (same structure as above)
└── Reports/                                              # Organised HTML reports
    ├── biological_qc_report*.html                        # Biological QC HTML report
    ├── <region_file_name_1>/                             # Reports for first region file (e.g. genes)
    │   ├── Extract_*mean*.html                           # Mean CpG coverage violin and correlation plots
    │   ├── Extract_*regional-frac*.html                  # 5mC and 5hmC fraction violin and correlation plots
    │   ├── DMR_Report_*.html                             # Unfiltered DMR visualisation reports
    │   └── DMR_Report_*_max_q_*.html                     # Filtered DMR visualisation reports
    ├── <region_file_name_2>/                             # Reports for second region file (e.g. promoters)
    │   └── ... (same structure as above)
    └── <region_file_name_N>/                             # Reports for Nth region file (e.g. if you add additional region files to the Regions/ directory)
        └── ... (same structure as above)

Reviewing results#

  1. Biological QC

    • Check sample clustering in the Biological QC HTML report PCA analysis (located in the main Reports/ directory). Use the colour-based grouping options directly on the report to visualise sample relationships. Groups are defined by the metadata (sample sheet) file.

    • Verify expected grouping patterns. Does the data make biological sense for your experimental design?

    • Identify any outlier samples.

    • Assess overall sample mean CpG coverage on the Biological QC report (Sample Information table).

    • Review the region-specific feature extraction plots (violin plots and correlation heatmaps) for mean CpG coverage, and 5mC and 5hmC fraction in the region-specific subdirectories within Reports/. These visualizations provide insights into feature distributions and sample correlations.

  2. Biological insights

    • Navigate to the region-specific subdirectories within Reports/ (e.g., Reports/gencode.v44.human.genes.annotation/).

    • Check the unfiltered DMR HTML reports for each region type and methylation context. Check the volcano plots and p-value distributions.

    • Focus on significant DMRs by reviewing the filtered DMR HTML reports (files with max_q_ in the name, indicating q-value < 0.05 filtering).

    • If needed, review the DMR results files (.bed) for specific regions of interest, or to filter or sort based on other metrics.

    • Note the region annotations in the DMR results files, which can be used to gain further biological insights into the DMRs.

  3. Validate findings

    • Check for consistency with known biology

    • Consider experimental design and potential confounders

    • Plan follow-up validation experiments or to increase sample size if necessary

Common result patterns and interpretation#

Healthy DMR patterns:

  • Detection of significant DMRs between experimental groups

  • Consistent p-value distributions (uniform with some enrichment near zero)

  • Biologically meaningful effect sizes

  • Concordant patterns between 5mC and 5hmC where expected

Potential issues to watch for:

  • Poor group separation: May indicate experimental design issues

  • Skewed p-value distributions: May benefit from Overdispersion correction or other statistical adjustments

  • Very large effect sizes: May indicate technical artifacts

  • Inconsistent patterns: Could suggest batch effects or confounders

Biological interpretation guidelines:

  • Promoter hypermethylation: Often associated with deactivated regions

  • 5hmC enrichment: Often associated with active demethylation (transition to active regions), or a marker of active gene bodies.

  • Concordant changes: 5mC loss with 5hmC gain may suggest active demethylation

Advanced analysis workflow#

This section provides a more detailed description of the Core Workflow bash script, which may be beneficial to users who want to modify the script directly or build their own workflows.

The direct download link to the Core Workflow script is provided below, which provides the .sh file only.

Download: core-workflow-v1.3.sh - Core Workflow script assembled from all code sections below.

Download a test dataset#

You can download a test folder containing mouse ES-E14 data to run the Core Workflow. The outputs are not intended to reveal biological insight, as the samples are technical replicates. However, it can be used to test that the script runs to completion.

The E14-example test folder contains:

  • A sample metadata file E14_chromosome_19.tsv

  • A zarr store E14_chromosome_19.zarrz containing methylation data

  • A directory Regions containing gencode.vM25.mouse.promoters.annotation.bed

  • A config.txt file with example parameters

Download: E14-example.zip

Note

The E14-example.zip does not contain the core-workflow-v1.3.sh script. This will need to be downloaded separately and run from within the E14-example directory after extraction.

Advanced configuration file setup#

The Core Workflow uses a simple config.txt file for all settings. This design separates configuration from code, making it easy to run the same pipeline with different datasets or parameters. If you are new to modality XPLR or have not yet run the Core Workflow, we suggest using the Interactive Core Workflow generator to get started.

Configuration file:

This file must be present in your working directory before running the pipeline, with the name config.txt. Alternatively use the Advanced interactive configuration file generator below to create this file with additional options. The aim of the config.txt is to pre-define all parameters required by the workflow script. If you modify the Core Workflow script to include additional options or settings, these can be added to the config.txt as variables. For the Core Workflow, the config.txt file should contain the following parameters (parameter values are examples):

Zarr=/path/to/data.zarrz
Metadata=/path/to/metadata.csv
Group_Column=label
Condition_Order=condition1,condition2
Covariates=age,batch,sex
QC_Region=chr1
Regions_Directory=/path/to/regions
Main_Output=/path/to/output_dir
Depth_Filter=5
Overdispersion=False

Configuration parameters explained:

  • Zarr: Path to zarr store(s). Use space separation for multiple paths

  • Metadata: Path to .csv or .tsv sample sheet file containing sample metadata

  • Group_Column: Column name in the metadata file, for grouping samples (e.g., “treatment”, “condition”)

  • Covariates: Space-separated list of covariate columns (optional) (e.g., “age batch”, “sex race” etc.)

  • QC_Region: Region specification for Biological QC analysis (optional) (e.g., “chr1” or “chr1:1000-2000”)

  • Regions_directory: Directory containing BED/TSV files in BED3+ format, with genomic regions and annotations. See Region files for details

  • Main_Output: Base directory where all results will be written

  • Depth_Filter: Minimum depth filter for analyses (leave it blank to disable filtering)

  • Overdispersion: Apply overdispersion correction for DMR calling (“True” or “False”, optional)

For a description of Sample sheet and Region files formats, see the linked modality XPLR documentation sections.

The config.txt can be generated using the Advanced interactive configuration file generator below or created manually in a text editor.

Advanced interactive configuration file generator#

🔧 Advanced Configuration File Generator

Use this form to generate and download your the config.txt file. A download link for the core-workflow-v1.3.sh is also provided. The displayed values are indications only, and must be replaced in order to populate the config.txt. After generating the file, a preview panel will appear. Check that your settings have been written correctly.

Comma-separated paths if multiple zarr stores
Column name in the metadata (sample sheet) file to be used for retrieving sample groups
Space-separated order of conditions (control group first)
Space-separated list of covariate column names, in the metadata (sample sheet) file
Select a region to limit the Biological QC to a specific region e.g. "chr" or "chr:<start>-<end>". This can help to speed up analysis and reduce memory usage.
Directory containing pre-prepared BED/TSV region definition files (e.g. promoters, genes etc.)
Main output directory, in which all results and reports will be saved.
Minimum context depth for regional-frac and DMR analysis e.g. 10. Leave blank to disable depth filtering.
Apply overdispersion correction for DMR calling.

Core Workflow overview: step-by-step#

The Core Workflow script is organised into distinct sections that can be modified independently:

  1. Setup and Initialisation: Script configuration and logging setup

  2. Input Validation: Comprehensive checks of all inputs before execution

  3. Analysis Steps: Four sequential analysis stages, designed to deliver Biological QC and Biological Insight analyses.

  4. Output Management: Organised collection and reporting of results

Script setup and configuration#

The script begins with essential setup including logging configuration and error handling.

Note

This code block includes instructions for enabling logging to a file, which may be useful for debugging and tracking script execution. To enable, uncomment (remove #) from the code lines as directed.

#!/usr/bin/env bash

###############################################################################
# modality XPLR Core Workflow Pipeline
#
# This script runs a comprehensive analysis pipeline including:
# - Biological QC
# - Feature extraction
# - DMR calling and visualisation
#
# Input requirements:
# - config.txt file in current directory with required parameters
# - Zarr store(s) containing methylation data
# - Sample metadata CSV or TSV file
# - Directory with region BED/TSV files
#
# This script is intended for running on modality XPLR version 1.0.0 or later.
#
###############################################################################

#------------------------------------------------------------------------------
# CONFIGURATION
#------------------------------------------------------------------------------
# Set script behaviour
set -euo pipefail

# Record start time for performance tracking
start_time=$(date +%s)

# Log file setup - uncomment to enable logging to file
# LOG_FILE="./modality_pipeline_$(date +%Y%m%d_%H%M%S).log"
# exec > >(tee -a "$LOG_FILE") 2>&1

Helper functions#

These logging functions provide consistent feedback throughout the pipeline execution.

#------------------------------------------------------------------------------
# HELPER FUNCTIONS
#------------------------------------------------------------------------------
# Logging functions
log_info() {
  echo "[INFO] $(date '+%Y-%m-%d %H:%M:%S') - $1"
}

log_warning() {
  echo "[WARNING] $(date '+%Y-%m-%d %H:%M:%S') - $1" >&2
}

log_error() {
  echo "[ERROR] $(date '+%Y-%m-%d %H:%M:%S') - $1" >&2
}

log_debug() {
  if [[ "${DEBUG:-false}" == "true" ]]; then
    echo "[DEBUG] $(date '+%Y-%m-%d %H:%M:%S') - $1"
  fi
}

Utility functions#

These utility functions handle common tasks like directory validation and report collection.

 # Helper to trim whitespace
trim() {
  awk '{$1=$1};1'
}

# Function to check if a header exists in metadata
header_exists() {
  local col="$1"
  local headers=("${@:2}")

  for h in "${headers[@]}"; do
    [[ "$h" == "$col" ]] && return 0
  done
  return 1
}

# Function to create directories with validation
create_directory() {
  local dir="$1"

  if [[ ! -d "$dir" ]]; then
    mkdir -p "$dir" 2>/dev/null || { log_error "Cannot create directory $dir"; exit 1; }
  fi
  if [[ ! -w "$dir" ]]; then
    log_error "No write access to directory $dir"
    exit 1
  fi
  log_info "Directory ready: $dir"
}

# Function to copy reports to destination
copy_reports() {
  local source_dir="$1"
  local pattern="$2"
  local dest_dir="$3"
  local description="$4"

  log_info "Copying $description HTML reports to $dest_dir..."
  local files=$(find "$source_dir" -type f -name "$pattern")

  if [[ -z "$files" ]]; then
    log_error "No $description HTML reports found in $source_dir"
    return 1
  fi

  local count=0
  while IFS= read -r file; do
    cp "$file" "$dest_dir/"
    ((count++))
  done <<< "$files"

  log_info "Copied $count $description HTML report(s)"
  return 0
}

Initialisation and configuration loading#

This section initialises the script and loads the configuration file. If you want to change the name of the input configuration file from the default config.txt, you can do so by changing the CONFIG_FILE variable in the script. Similarly, if you want to control the verbosity of workflow logging, you can do so by changing the DEBUG variable in the script to true.

#------------------------------------------------------------------------------
# INITIALISATION
#------------------------------------------------------------------------------
printf "\n=======================================\n"
printf "STEP 0 - CONFIGURATION & INITIALISATION\n"
printf "=======================================\n"
log_info "Starting modality XPLR Core Workflow Pipeline"

# Default values for variables that could be customised
CONFIG_FILE="./config.txt"
DEBUG=false  # Set to true to enable debug output

The next part will read the configuration file and set values for all required parameters. If you wish to change existing parameters values, you can do so by editing the configuration (config.txt) file in your working directory. If you want to add new variables to be used in downstream modality XPLR commands, ensure they are defined here and then prompted to assign a value when reading in the configuration file. For any variables that should be hard-coded, the value can be set directly in the script and excluded from the configuration file.

#------------------------------------------------------------------------------
# LOAD CONFIGURATION
#------------------------------------------------------------------------------
log_info "Reading configuration from $CONFIG_FILE"

# Validate config file exists
if [[ ! -f "$CONFIG_FILE" ]]; then
  log_error "config.txt not found in the current working directory: $(pwd)"
  log_error "Please ensure config.txt is present before running this script."
  exit 1
fi

# Read config.txt variables
ZARR=""                 # Path(s) to zarr store(s), comma-separated
METADATA=""             # Path to sample metadata CSV or TSV file
GROUP_COLUMN=""         # Column name in metadata for grouping samples
COVARIATES=""           # Space-separated list of covariate column names (optional)
CONDITION_ORDER=""      # Comma-separated order of conditions for DMR analysis (optional)
QC_REGION=""            # Region for Biological QC analysis (optional)
REGIONS_DIR=""          # Directory containing region BED/TSV files
MAIN_OUTPUT=""          # Base output directory
DEPTH_FILTER=""         # Minimum depth filter for analyses, e.g. 10 (optional)
OVERDISPERSION=""       # Apply overdispersion correction: "True" or "False" (optional)

# Parse config file
while IFS='=' read -r key value; do
  key=$(echo "$key" | trim)
  value=$(echo "$value" | trim)
  case "$key" in
    Zarr) ZARR="$value" ;;
    Metadata) METADATA="$value" ;;
    Group_Column) GROUP_COLUMN="$value" ;;
    Covariates) COVARIATES="$value" ;;
    Condition_Order) CONDITION_ORDER="$value" ;;
    QC_Region) QC_REGION="$value" ;;
    Regions_Directory) REGIONS_DIR="$value" ;;
    Main_Output) MAIN_OUTPUT="$value" ;;
    Depth_Filter) DEPTH_FILTER="$value" ;;
    Overdispersion) OVERDISPERSION="$value" ;;
    # You could add additional parameters here as needed
  esac
done < "$CONFIG_FILE"

# Log parsed configuration
printf "\n\n>>Provided inputs\n"
log_info "ZARR path(s): $ZARR"
log_info "Metadata file: $METADATA"
log_info "Group column: $GROUP_COLUMN"
log_info "Covariates: $COVARIATES"
log_info "Condition order: $CONDITION_ORDER"
log_info "QC region: $QC_REGION"
log_info "Regions directory: $REGIONS_DIR"
log_info "Main output directory: $MAIN_OUTPUT"
log_info "Depth filter: $DEPTH_FILTER"
log_info "Overdispersion correction: $OVERDISPERSION"

Input validation#

This section validates all inputs and ensures they exist and are accessible. Use this section to add any custom validation checks for your specific workflow requirements, file paths, and access permissions, according to the variables defined in the config.txt file.

#------------------------------------------------------------------------------
# VALIDATE INPUTS
#------------------------------------------------------------------------------
printf "\n\n>>Validation & directory creation\n"
log_info "Validating inputs..."

# Parse and validate zarr paths
if [[ -n "$ZARR" ]]; then
  # Split comma-separated zarr paths into array
  IFS=',' read -r -a ZARR_RAW_ARRAY <<< "$ZARR"

  # Process each zarr path to remove quotes and trim whitespace
  ZARR_ARRAY=()
  for zarr_raw in "${ZARR_RAW_ARRAY[@]}"; do
    # Trim leading/trailing whitespace
    zarr_clean=$(echo "$zarr_raw" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')

    # Remove surrounding single or double quotes if present
    zarr_clean=$(echo "$zarr_clean" | sed 's/^["'"'"']\(.*\)["'"'"']$/\1/')

    # Add to processed array if not empty
    if [[ -n "$zarr_clean" ]]; then
      ZARR_ARRAY+=("$zarr_clean")
    fi
  done

  # Log the number of zarr paths identified and their strings
  log_info "Identified ${#ZARR_ARRAY[@]} zarr path(s):"
  for i in "${!ZARR_ARRAY[@]}"; do
    log_info "  [$((i+1))] ${ZARR_ARRAY[i]}"
  done

  # Validate each zarr path
  for zarr in "${ZARR_ARRAY[@]}"; do
    if [[ ! -s "$zarr" ]]; then
      log_error "Zarr file '$zarr' doesn't contain any data or is not readable"
      exit 1
    fi
  done
else
  log_error "No zarr paths specified in config.txt"
  exit 1
fi

# Process and validate metadata file path
if [[ -n "$METADATA" ]]; then
  # Trim leading/trailing whitespace
  METADATA=$(echo "$METADATA" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')

  # Remove surrounding single or double quotes if present
  METADATA=$(echo "$METADATA" | sed 's/^["'"'"']\(.*\)["'"'"']$/\1/')

  log_info "Processed metadata file path: $METADATA"
else
  log_error "No metadata file specified in config.txt"
  exit 1
fi

# Process and validate main output directory path
if [[ -n "$MAIN_OUTPUT" ]]; then
  # Trim leading/trailing whitespace
  MAIN_OUTPUT=$(echo "$MAIN_OUTPUT" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')

  # Remove surrounding single or double quotes if present
  MAIN_OUTPUT=$(echo "$MAIN_OUTPUT" | sed 's/^["'"'"']\(.*\)["'"'"']$/\1/')

  log_info "Processed main output directory: $MAIN_OUTPUT"
else
  log_error "No main output directory specified in config.txt"
  exit 1
fi

# Process and validate regions directory path
if [[ -n "$REGIONS_DIR" ]]; then
  # Trim leading/trailing whitespace
  REGIONS_DIR=$(echo "$REGIONS_DIR" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')

  # Remove surrounding single or double quotes if present
  REGIONS_DIR=$(echo "$REGIONS_DIR" | sed 's/^["'"'"']\(.*\)["'"'"']$/\1/')

  log_info "Processed regions directory: $REGIONS_DIR"
else
  log_error "No regions directory specified in config.txt"
  exit 1
fi

# Check metadata file
log_info "Validating metadata file headers..."
if [[ ! -r "$METADATA" ]]; then
  log_error "Metadata file not found or not readable: $METADATA"
  exit 1
fi

# Check group column and covariates in metadata
HEADER_LINE=$(head -n 1 "$METADATA")

# Remove any line ending characters (Windows \r\n, Mac \r, Unix \n)
HEADER_LINE=$(echo "$HEADER_LINE" | tr -d '\r\n')

# Detect delimiter: use tab if present, otherwise comma
if [[ "$HEADER_LINE" == *$'\t'* ]]; then
  DELIM=$'\t'
else
  DELIM=','
fi

# Split header into array using the detected delimiter
IFS="$DELIM" read -r -a HEADERS_RAW <<< "$HEADER_LINE"

# Trim whitespace from each header individually
HEADERS=()
for header in "${HEADERS_RAW[@]}"; do
  # Trim leading and trailing whitespace from each header
  header_clean=$(echo "$header" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
  if [[ -n "$header_clean" ]]; then
    HEADERS+=("$header_clean")
  fi
done

if [[ -n "$GROUP_COLUMN" ]]; then
  if ! header_exists "$GROUP_COLUMN" "${HEADERS[@]}"; then
    log_error "Group column '$GROUP_COLUMN' not found in metadata file header. Please review the metadata file and provide a valid group column."
    exit 1
  fi
fi

if [[ -n "$COVARIATES" ]]; then
  # Convert space-separated list to array
  read -r -a COVS <<< "$COVARIATES"
  for cov in "${COVS[@]}"; do
    cov=$(echo "$cov" | trim)
    if ! header_exists "$cov" "${HEADERS[@]}"; then
      log_error "Covariate column '$cov' not found in metadata file header"
      exit 1
    fi
  done
fi

log_info "Metadata validation passed"

# Check regions directory
if [[ ! -d "$REGIONS_DIR" ]]; then
  log_error "Regions directory not found: $REGIONS_DIR"
  exit 1
fi

Output directory setup#

This section creates and defines output sub-directories in MAIN_OUTPUT that can be used to organise the results.

By default, all results for each modality XPLR CLI operation (including provenance metadata) will be written to the Results sub-directory. Then, all HTML reports and visualisations will be copied to the Reports sub-directory for quick access. All modality XPLR outputs are timestamped, however you may wish to modify this section to add additional sub-directories for specific analyses, or to change the output directory structure. The temp sub-directory is used to store intermediate files, such as the merged regions file, and can optionally be deleted after the workflow completes. By default, no delete command is included.

#------------------------------------------------------------------------------
# SETUP OUTPUT DIRECTORIES
#------------------------------------------------------------------------------
log_info "Creating output directories under $MAIN_OUTPUT..."

# Create output directories
for sub in "" "Results" "Reports"; do
  dir="$MAIN_OUTPUT"
  [[ -n "$sub" ]] && dir="$MAIN_OUTPUT/$sub"
  create_directory "$dir"
done

# Define output subdirectories for easier reference
RESULTS_DIR="$MAIN_OUTPUT/Results"
REPORTS_DIR="$MAIN_OUTPUT/Reports"

# Create subdirectories for each region file found in the regions directory, to organise outputs by region file input.
log_info "Creating region-specific subdirectories..."
for bedfile in "$REGIONS_DIR"/*.bed "$REGIONS_DIR"/*.bed.gz "$REGIONS_DIR"/*.tsv "$REGIONS_DIR"/*.tsv.gz; do
  # Skip if no files match pattern
  [[ -e "$bedfile" ]] || continue
  [[ "${METADATA##*.}" == "tsv" && $(realpath "$bedfile") == $(realpath "$METADATA") ]] && continue

  # Extract region file name without path and extension
  region_name=$(basename "$bedfile")
  region_name="${region_name%%.bed.gz}"
  region_name="${region_name%%.bed}"
  region_name="${region_name%%.tsv.gz}"
  region_name="${region_name%%.tsv}"

  # Create region-specific subdirectories in both Results and Reports
  region_results_dir="$RESULTS_DIR/$region_name"
  region_reports_dir="$REPORTS_DIR/$region_name"
  create_directory "$region_results_dir"
  create_directory "$region_reports_dir"
  log_info "Created region directories: $region_results_dir and $region_reports_dir"
done

Step 1: Biological QC HTML report#

The first analysis step performs biological quality control to provide Zarr store summaries, and to identify sample relationships.

What this script block does:

  • Generates correlation matrices between samples

  • Creates PCA plots to visualise sample clustering (grouping options are available on the HTML report)

  • Calculates mean CpG coverage for each sample

Key parameters:

  • Uses all zarr stores specified in configuration (config.txt)

  • Allows sample metadata to be used for dynamic colour-based grouping, when available

Note

The Core Workflow does not include the --aggregate-by-group option, which could be included to aggregate sample data for technical replicates prior to QC.

#------------------------------------------------------------------------------
# STEP 1: BIOLOGICAL QC
#------------------------------------------------------------------------------
printf "\n\n======================================\n"
printf "STEP 1: RUNNING BIOLOGICAL QC ANALYSES\n"
printf "======================================\n"

# Build region argument if QC_REGION is set
REGION_ARG=""
[[ -n "$QC_REGION" ]] && REGION_ARG="--region $QC_REGION"

# Run QC without aggregation
log_info "Running biological QC without aggregation..."
modality -v biological-qc \
  --zarr-path "${ZARR_ARRAY[@]}" \
  --sample-sheet "$METADATA" \
  $REGION_ARG \
  --order-by-group "$GROUP_COLUMN" \
  --output-dir "$RESULTS_DIR"

# Copy QC reports to Reports directory
copy_reports "$RESULTS_DIR" "biological_qc_report*.html" "$REPORTS_DIR" "Biological QC" || exit 1

Expected outputs:

  • biological_qc_report.html: HTML report with sample-wise plots and statistics.

  • Provenence metadata .json files, stored in each Results/BioQC_YYYYMMDD_HHMMSS sub-directory that is generated by each modality XPLR CLI command.

Interpreting QC results:

  • Correlation heatmaps: Correlation coefficients close to 1 indicate high similarity of 5mC or 5hmC levels between samples

  • PCA plots: Separation and/or clustering between experimental groups indicates the % variability that is explained by the first two principal components, for 5mC or 5hmC.

  • Coverage statistics: The mean CpG coverage per sample is an indicator of the usable depth for methylation analysis. See the next step for feature extraction (region-based) coverage statistics.

  • Outlier detection: Samples that don’t cluster with their expected group may need investigation. Removing a sample from the metadata (sample sheet) file will exclude it from future analysis, without needing to modify the original Zarr file.

Step 2: Region-based Biological QC#

This step used the modality get tool to extract methylation features (statistics) from user-specified genomic regions, for additional quality control and analysis. The regions can be defined in BED3+ format (.bed` or ``.tsv) files, that are accessible in the REGIONS_DIR, defined in config.txt. For the Core Workflow script to run correctly, region files must be present. Pre-prepared BED files are provided to get started, or custom files following the same format can be used instead. Use the correct region files for the reference genome used for alignment in the duet Software.

Tip

The Annotation (e.g. gene, or promoter) and Name (e.g. associated gene name) columns should be completed for meaningful interpretation. Regions without an Annotation value will be skipped.

The pipeline performs multiple types of statistics extraction to provide comprehensive regional methylation data. Each statistic type is processed separately by modality XPLR, and the results are stored in sub-directories named according to the extraction type (e.g. Extract_count, Extract_mean, Extract_regional_frac).

Note

Only modality get mean and modality get regional-frac commands support plotting (correlation or violin), which are useful for visualising the distribution of methylation levels across samples. If no HTML report is generated, then the results will only be written to MAIN_OUTPUT/Results/Extract_* sub-directories as .tsv.gz files. HTML reports will be copied to the MAIN_OUTPUT/Reports sub-directory for easy access.

What this script block does:

  • Count extraction: Returns the total number of CpG contexts (total_c) for each region.

  • Mean calculation: Average coverage of CpG contexts (total_c) for each region, with violin and correlation plots for individual samples.

  • Regional fraction: Methylation fractions for 5mC and 5hmC at the pre-specified minimum``DEPTH_FILTER`` value, with violin and correlation plots for individual samples.

  • Processes each region file in the REGIONS_DIR independently. Therefore, expect individual results for each region file and extraction type.

#------------------------------------------------------------------------------
# STEP 2: FEATURE EXTRACTION
#------------------------------------------------------------------------------
printf "\n\n==================================================\n"
printf "STEP 2: RUNNING FEATURE EXTRACTION ON REGION FILES\n"
printf "==================================================\n"

# Process each region file for feature extraction
for bedfile in "$REGIONS_DIR"/*.bed "$REGIONS_DIR"/*.bed.gz "$REGIONS_DIR"/*.tsv "$REGIONS_DIR"/*.tsv.gz; do
  # Skip if no files match pattern
  [[ -e "$bedfile" ]] || continue
  [[ "${METADATA##*.}" == "tsv" && $(realpath "$bedfile") == $(realpath "$METADATA") ]] && continue

  log_info "Processing region file: $bedfile"

  # Extract region file name without path and extension
  region_name=$(basename "$bedfile")
  region_name="${region_name%%.bed.gz}"
  region_name="${region_name%%.bed}"
  region_name="${region_name%%.tsv.gz}"
  region_name="${region_name%%.tsv}"

  # Set region-specific output directories
  region_output_dir="$RESULTS_DIR/$region_name"
  region_reports_dir="$REPORTS_DIR/$region_name"

  # 1. Extract count for total_c
  log_info "Running feature extraction: count total_c..."
  modality -v get count \
    --zarr-path "${ZARR_ARRAY[@]}" \
    --sample-sheet "$METADATA" \
    --bedfile "$bedfile" \
    --fields total_c \
    --output-dir "$region_output_dir"

  # 2. Extract mean for total_c with plots generated automatically
  log_info "Running feature extraction: mean total_c (plots generated automatically)..."
  modality -v get mean \
    --zarr-path "${ZARR_ARRAY[@]}" \
    --sample-sheet "$METADATA" \
    --bedfile "$bedfile" \
    --fields total_c \
    --order-by-group "$GROUP_COLUMN" \
    --output-dir "$region_output_dir"

  # Copy mean HTML reports immediately
  for extract_dir in "$region_output_dir"/Extract_*; do
    if [[ -d "$extract_dir" ]]; then
      find "$extract_dir" -type f -name "*mean_extract_report*.html" -exec cp {} "$region_reports_dir/" \; 2>/dev/null || true
    fi
  done

  # 3. Extract regional-frac for mc and hmc with depth filter (plots generated automatically)
  log_info "Running feature extraction: regional-frac for mc and hmc with minimum depth $DEPTH_FILTER (plots generated automatically)..."
  # Build the optional argument if DEPTH_FILTER is set
  MCV_ARG=""
  [[ -n "$DEPTH_FILTER" ]] && MCV_ARG="--min-coverage $DEPTH_FILTER"
  # --min-coverage 30, if DEPTH_FILTER=30 in config.txt

  modality -v get regional-frac \
    --zarr-path "${ZARR_ARRAY[@]}" \
    --sample-sheet "$METADATA" \
    --bedfile "$bedfile" \
    --fields mc hmc \
    $MCV_ARG \
    --order-by-group "$GROUP_COLUMN" \
    --output-dir "$region_output_dir"

  # Copy regional-frac HTML reports immediately
  for extract_dir in "$region_output_dir"/Extract_*; do
    if [[ -d "$extract_dir" ]]; then
      find "$extract_dir" -type f -name "*regional-frac_extract_report*.html" -exec cp {} "$region_reports_dir/" \; 2>/dev/null || true
    fi
  done

  log_info "Feature extraction completed for $region_name"
done

log_info "Feature extraction HTML reports copied to region-specific directories"

Expected outputs:

  • Extract_* directories: Organised by statistic type and region file

  • *.tsv.gz files: Compressed quantitative results for each extracted statistic, by region and sample

  • *.html reports: Interactive violin and correlation plots showing methylation statistics for mean and regional-frac extraction types.

Note

The optional regional dendrogram heatmap is not included in the Core Workflow script by default but can be added by modifying the modality get mean and/or modality get regional-frac command(s) to include --heatmap. Note that the heatmap will only be drawn if there are ≤200 regions in the region file, so it is not appropriate for use with all promoters and gene bodies. Instead, add a custom subset region file to the Regions directory before running the Core Workflow.

Interpreting region-based Biological QC statistics:

  • Region-specific CpG coverage: Should show adequate and consistent coverage across samples and may be used to inform the DEPTH_FILTER value for future analyses.

  • Methylation fraction distributions: Violin and correlation plots reveal methylation heterogeneity.

  • Outlier samples: Samples with unusual distributions may need investigation.

Customising statistics extraction:

You can modify this section of the script to:

  • Add different field types (see Zarr store and Extracting methylation statistics sections in the main modality XPLR documentation).

  • Adjust minimum coverage thresholds for regional fraction extraction.

  • Include additional statistical summaries (count, mean, sum, regional-frac).

  • Add an optional --heatmap parameter to generate regional dendrogram heatmaps for mean and/or regional-frac extraction types with ≤200 regions.

Step 3: DMR calling#

This step identifies differentially methylated regions using statistical modelling, processing each region file individually. DMR calling is the core analytical component that identifies regions with significant methylation differences between two experimental groups.

What this script block does:

  • Processes each .bed and .tsv file in the regions directory separately

  • Performs statistical testing for differential methylation on each region file (see Calling differentially methylated regions (DMRs))

  • Analyses both 5mC and 5hmC modifications separately

  • Analyses all available condition pairs from the metadata, preserving the sample sheet order for use with the --condition-order parameter. Ensure the control group is the first one listed in the metadata file; or analyses only the conditions specified in the config.txt, with the left-most condition name as the control group.

  • Applies multiple testing correction (FDR)

  • Incorporates covariates if specified

Note

The DMR calling does not enable --overdispersion for all analyses by default. For the Core Workflow, overdispersion correction is controlled by the Overdispersion parameter in the configuration file and can be set to “True” or “False”. When set to “True”, overdispersion correction is applied to all DMR analyses. The recommendation is to review DMR results prior to applying overdispersion correction, as this can lead to more conservative results. However, overdispersion correction may be useful when calling DMRs on large regions, such as gene bodies. For more information on when to use overdispersion, see Overdispersion correction.

Key analysis parameters:

  • Methylation contexts: Analyses both num_mc and num_hmc by default

  • Depth filtering: Should be applied according to the sequencing depth and methylation context coverage, and specified in the config.txt file

  • Covariate adjustment: Controls for confounding variables when specified in the config.txt file

#------------------------------------------------------------------------------
# STEP 3: DMR CALLING
#------------------------------------------------------------------------------
printf "\n\n===========================\n"
printf "STEP 3: RUNNING DMR CALLING\n"
printf "===========================\n"

# Build covariate / depth_filter / condition_order / overdispersion flags if present
COV_FLAGS=""
DEPTH_ARG=""
CONDITION_ARG=""
OVERDISPERSION_ARG=""

[[ -n "$DEPTH_FILTER" ]] && DEPTH_ARG="--filter-context-depth $DEPTH_FILTER"
[[ -n "$CONDITION_ORDER" ]] && CONDITION_ARG="--condition-order $CONDITION_ORDER"
[[ "$OVERDISPERSION" == "True" ]] && OVERDISPERSION_ARG="--overdispersion"

if [[ -n "$COVARIATES" ]]; then
  # Use space-separated covariates directly for --covariates flag
  COV_FLAGS="--covariates $COVARIATES"
fi

# Run DMR calling on each region file individually
for bedfile in "$REGIONS_DIR"/*.bed "$REGIONS_DIR"/*.bed.gz "$REGIONS_DIR"/*.tsv "$REGIONS_DIR"/*.tsv.gz; do
  # Skip if no files match pattern
  [[ -e "$bedfile" ]] || continue
  [[ "${METADATA##*.}" == "tsv" && $(realpath "$bedfile") == $(realpath "$METADATA") ]] && continue

  log_info "Running DMR calling on region file: $bedfile"

  # Extract region file name without path and extension
  region_name=$(basename "$bedfile")
  region_name="${region_name%%.bed.gz}"
  region_name="${region_name%%.bed}"
  region_name="${region_name%%.tsv.gz}"
  region_name="${region_name%%.tsv}"

  # Set region-specific output directory
  region_output_dir="$RESULTS_DIR/$region_name"

  # Run DMR calling
  # --covariates cov1 cov2, if COVARIATES="cov1 cov2" in config.txt
  # --filter-context-depth 15, if DEPTH_FILTER=15 in config.txt
  # --condition-order condition1,condition2, if CONDITION_ORDER="condition1,condition2" in config.txt
  # --overdispersion is applied if OVERDISPERSION="True" in config.txt

  modality -v dmr call \
    --zarr-path "${ZARR_ARRAY[@]}" \
    --sample-sheet "$METADATA" \
    --condition-array-name "$GROUP_COLUMN" \
    --methylation-contexts num_mc num_hmc \
    --bedfile "$bedfile" \
    $COV_FLAGS \
    $DEPTH_ARG \
    $CONDITION_ARG \
    $OVERDISPERSION_ARG \
    --output-dir "$region_output_dir"

  log_info "DMR calling completed for $bedfile"
done

log_info "All DMR calling completed"

Expected outputs:

  • DMR_* directory: All DMR results (for all group pairs and contexts) are stored in a single directory with a timestamp

  • *.bed files: DMR results in BED3+ format for each condition pair and methylation context

  • *.ini files: Configuration files for genomic visualisation with modality tracks (not included as part of the Core Workflow). One .ini file is generated for each DMR .bed result file

Step 4: DMR visualisation#

The final step generates visualisations for DMR results. This includes both unfiltered and filtered plots to help identify the most significant regions.

What this script block does:

  • Creates HTML reports for each DMR result file (condition pair) and methylation context (5mC and 5hmC)

  • HTML reports contain a volcano plot (modification difference vs -log10 q-value) and a p-value distribution histogram

  • HTML reports are generated for both unfiltered DMR results, and filtered DMR results based on a q-value threshold of ≤ 0.05

  • When filtering is applied, a new filtered, compressed DMR results file (.bed.gz) is also generated

Important

The filtering of DMR results is optional and can be removed by omitting the --filter-qvalue parameter in the modality dmr plot command below. Alternatively, the q-value threshold can be modified, or the additional --filter-mod-difference parameter can be added to filter based on the magnitude of methylation difference.

It is recommended to review the DMR results prior to applying filtering to determine the most appropriate thresholds, or to assess whether the analysis would benefit from overdispersion correction (if not already applied).

#------------------------------------------------------------------------------
# STEP 4: DMR VISUALISATION
#------------------------------------------------------------------------------
printf "\n\n=====================================\n"
printf "STEP 4: GENERATING DMR VISUALISATIONS\n"
printf "=====================================\n"

# Loop through each region directory
for region_dir in "$RESULTS_DIR"/*; do
  [[ -d "$region_dir" ]] || continue

  region_name=$(basename "$region_dir")
  # Skip if this is not a region directory (e.g., if it's a BioQC directory)
  [[ "$region_name" == "BioQC_"* ]] && continue

  log_info "Processing DMR visualisation for region: $region_name"

  # Set region-specific reports directory
  region_reports_dir="$REPORTS_DIR/$region_name"

  # Loop through each DMR results directory within this region
  for dmr_dir in "$region_dir"/DMR_*; do
    [[ -d "$dmr_dir" ]] || continue

    # For each .bed file in the DMR directory
    for dmrresults in "$dmr_dir"/*.bed; do
      [[ -e "$dmrresults" ]] || continue

      log_info "Generating DMR plots for $dmrresults..."

      # Generate default plots - output to the same region directory
      modality -v dmr plot \
        --dmr-results "$dmrresults" \
        --output-dir "$region_dir"

      # Copy unfiltered DMR reports immediately
      for dmr_report_dir in "$region_dir"/DMR_Report_*; do
        if [[ -d "$dmr_report_dir" ]]; then
          find "$dmr_report_dir" -type f -name "*.html" -not -name "*max_q_*" -exec cp {} "$region_reports_dir/" \; 2>/dev/null || true
        fi
      done

      # Generate filtered plots (q-value < 0.05) - output to the same region directory
      modality dmr plot \
        --dmr-results "$dmrresults" \
        --filter-qvalue 0.05 \
        --output-dir "$region_dir"

      # Copy filtered DMR reports immediately
      for dmr_report_dir in "$region_dir"/DMR_Report_*; do
        if [[ -d "$dmr_report_dir" ]]; then
          find "$dmr_report_dir" -type f -name "*max_q_*.html" -exec cp {} "$region_reports_dir/" \; 2>/dev/null || true
        fi
      done

      log_info "DMR plots generated for $dmrresults"
    done
  done

  log_info "DMR visualisation completed for region: $region_name"
done

log_info "DMR HTML reports copied to region-specific directories"

Interpreting DMR visualisations:

Volcano plots:

  • Y-axis: -log10(q-value) - higher values indicate more significant DMRs

  • X-axis: Methylation modification difference - magnitude of effect

  • Colour coding: Applied if multiple Annotation values are present in the supplied region files (REGIONS_DIR) e.g. promoter, enhancer, etc.

  • Significant regions: Upper left/right quadrants (high significance, large effect)

  • Hyper- vs Hypo-methylation: negative modification difference (hypo) vs positive modification difference (hyper) on the X-axis

p-value histograms:

  • Uniform distribution: Indicates good statistical modelling

  • Enrichment near zero: Suggests presence of true differential methylation, but over-representation may indicate the analysis would benefit from overdispersion correction. See Overdispersion correction for more information.

  • Unusual patterns: May indicate technical issues or model violations

Quality assessment:

  • Volcano plot shape: Expect to see a V-shape with most data points clustered near the base and centre. Significant hits may appear as outliers on one or both sides of the x-axis, with high values on the y-axis. The data distribution may or may not by symmetrical, depending on the extent of hyper- and hypo-methylation

  • Effect size distribution: Biologically meaningful differences (typically modification difference > 0.1-0.2)

  • Annotation patterns: Different region (Annotation) types may show distinct methylation patterns

Advanced customisation and extension#

The Core Workflow is designed to be easily adapted for different research needs. Here are common modifications and extensions:

Customising analysis parameters#

Adjusting coverage thresholds:

# In the config.txt or script
DEPTH_FILTER=15  # Increase for higher quality regions

# Or modify specific commands directly using the --min-coverage option
modality get regional-frac \
  --zarr-path "${ZARR_ARRAY[@]}" \
  --bedfile "$bedfile" \
  --fields mc hmc \
  --min-coverage 15 \  # Custom minimum coverage
  --output-dir "$RESULTS_DIR"

Adding different methylation contexts:

# Include modC analysis instead of 5mC and 5hmC.
modality dmr call \
  --methylation-contexts modc \
  # ... other parameters

Add overdispersion correction to DMR calling:

# Add additional DMR calling options
modality dmr call \
  --overdispersion \
  # ... other parameters

Performance optimisation tips#

For large datasets:

  • Use subset regions for initial testing

  • Implement parallel processing for independent steps

  • Monitor disk space and memory usage

  • Consider using HPC environments for large-scale analysis

For production use:

  • Add comprehensive logging and monitoring

  • Implement checkpointing for long-running analyses

  • Add email notifications for completion/failure

  • Include resource usage reporting

Script completion and monitoring#

The script includes basic completion reporting and performance tracking:

#------------------------------------------------------------------------------
# COMPLETION
#------------------------------------------------------------------------------
log_info "Pipeline completed successfully. Reports are available in $REPORTS_DIR and region-specific subdirectories"

# Calculate and report duration
end_time=$(date +%s)
duration=$((end_time - start_time))
log_info "Script completed in $duration seconds"

# Exit cleanly
exit 0

Monitoring features:

  • Execution time tracking: Reports total pipeline duration

  • Success confirmation: Clear indication of successful completion

  • Output location reporting: Directs users to results

  • Clean exit handling: Proper script termination

Monitoring execution:

The script provides real-time logging output. You can also redirect output to a log file using the code below, or code it into the script as described in Script setup and configuration.

./core-workflow-v1.3.sh 2>&1 | tee workflow.log

Expected runtime:

Runtime varies significantly based on:

  • Dataset size: Number of samples and genomic regions

  • Analysis complexity: Number of methylation contexts and covariates

  • System resources: Available CPU and memory

  • Storage speed: Local vs network storage performance

Typical runtimes:

  • Small dataset (10 samples, 1000 regions): 10-30 minutes

  • Medium dataset (50 samples, 5000 regions): 1-3 hours

  • Large dataset (100+ samples, 10000+ regions): 3-8 hours

This modular workflow provides a comprehensive foundation for methylation analysis using the modality XPLR toolkit. By understanding each component, you can adapt it to meet your specific research requirements while maintaining robust error handling and organised outputs.

Warning

Version compatibility: This pipeline is designed for modality XPLR version 1.0.0 and later. Some features may not be available in earlier versions. Check your modality XPLR version with modality --version before running.

Tip

Getting help: If you encounter issues, first try enabling debug mode (DEBUG=true) for detailed logging. The modular design allows you to run individual sections for troubleshooting specific steps.