modalysis

modalysis is a pipeline-oriented toolkit for methylation and DMR analysis. It exposes a CLI, a FastAPI server, and reusable Python modules.

All user-facing operations run through this stack:

CLI parser -> CLI handler -> HTTP client -> FastAPI server -> core function

Prerequisites

Python 3.13+
uv

Install dependencies:

uv sync

Required Input Types

The pipeline expects these input categories:

GFF annotation file (for gene coordinates and descriptions)
Pileup .bed files (per sample/modification)
DMR .bed files
Expression TSV files (GENE_ID<TAB>STATUS, such as UP, DOWN, NDE)
Allowed chromosomes file (one chromosome name per line)

Output Types

Tabular command outputs: .modalysis (TSV)
Plot command outputs: .png
Optional dmr gene-counts --output-excel: .xlsx

Recommended Pipeline Order

Run commands in this order so each downstream stage has required inputs:

Start server: modalysis server
Format GFF: modalysis gff format
Annotate GFF with expression labels: modalysis gff annotate
Format each pileup file: modalysis pileup format
Merge pileups per manifestation/modification: modalysis pileup merge
Format each DMR file: modalysis dmr format
Annotate DMRs with gene regions: modalysis dmr annotate
Aggregate DMR results:
- modalysis dmr gene-counts
- modalysis dmr common-genes
Generate plots as needed:
- modalysis plot mean-methylation
- modalysis plot gene-heatmap
- modalysis plot dmr-dotplot
- modalysis plot common-genes-venn

Command Reference

Default server port: 8000.

`modalysis server`

Purpose: Start the FastAPI server used by all analysis commands.

Algorithm:

Launches fastapi run (or fastapi dev with --dev) against src/modalysis/server/main.py.

Usage:

uv run modalysis server [--port 8000] [--dev]

Parameters:

Flag	Required	Default	Description
`--port`	No	`8000`	Server port.
`--dev`	No	`False`	Enables autoreload development mode.

Output:

Running HTTP server (no .modalysis file).

`modalysis gff format`

Purpose: Normalize a raw GFF into the pipeline’s compact .modalysis gene table.

Algorithm:

Reads TSV rows from the source GFF.
Keeps only rows with exactly 9 columns.
Keeps only protein_coding_gene features.
Filters to chromosomes present in --allowed-chromosomes.
Converts start coordinate to zero-based (start - 1).
Extracts ID and description from attributes.
Writes columns: CHROMOSOME, START, END, GENE_ID, DESCRIPTION.

Usage:

uv run modalysis gff format \
  --input-path /path/to/input.gff \
  --output-path /path/to/output_dir \
  --output-name formatted_gff \
  --allowed-chromosomes /path/to/allowed_chromosomes.txt \
  [--port 8000]

Parameters:

Flag	Required	Default	Description
`--input-path`	Yes	-	Input GFF path.
`--output-path`	Yes	-	Output directory.
`--output-name`	Yes	-	Output basename (`.modalysis` appended).
`--allowed-chromosomes`	Yes	-	File with one valid chromosome per line.
`--port`	No	`8000`	Server port.

Output:

/path/to/output_dir/formatted_gff.modalysis

`modalysis gff annotate`

Purpose: Attach expression status labels to each GFF gene row.

Algorithm:

Loads each expression TSV as {GENE_ID -> STATUS}.
For every gene in formatted GFF, looks up each expression source.
Writes joined annotations like LABEL: VALUE; LABEL2: VALUE2 into EXPRESSION.

Usage:

uv run modalysis gff annotate \
  --gff-path /path/to/formatted_gff.modalysis \
  --expression-paths /path/to/expr_a.tsv /path/to/expr_b.tsv \
  --expression-labels tissue_a tissue_b \
  --output-path /path/to/output_dir \
  --output-name annotated_gff \
  [--port 8000]

Parameters:

Flag	Required	Default	Description
`--gff-path`	Yes	-	Formatted GFF `.modalysis` path.
`--expression-paths`	Yes	-	One or more expression TSV files.
`--expression-labels`	Yes	-	Label per expression file (same order).
`--output-path`	Yes	-	Output directory.
`--output-name`	Yes	-	Output basename.
`--port`	No	`8000`	Server port.

Output:

/path/to/output_dir/annotated_gff.modalysis with added EXPRESSION column.

`modalysis pileup format`

Purpose: Normalize raw pileup records into a minimal .modalysis representation.

Algorithm:

Reads raw pileup rows.
Keeps only rows with exactly 18 columns.
Filters by allowed chromosomes.
Extracts columns for genomic key and counts.
Writes columns: CHROMOSOME, START, END, MODIFICATION, N_VALID_COV, N_MOD.

Usage:

uv run modalysis pileup format \
  --input-path /path/to/raw_pileup.bed \
  --output-path /path/to/output_dir \
  --output-name sample_mod \
  --allowed-chromosomes /path/to/allowed_chromosomes.txt \
  [--port 8000]

Parameters:

Flag	Required	Default	Description
`--input-path`	Yes	-	Raw pileup file path.
`--output-path`	Yes	-	Output directory.
`--output-name`	Yes	-	Output basename.
`--allowed-chromosomes`	Yes	-	File with one valid chromosome per line.
`--port`	No	`8000`	Server port.

Output:

/path/to/output_dir/sample_mod.modalysis

`modalysis pileup merge`

Purpose: Aggregate multiple formatted pileup files by genomic key.

Algorithm:

Uses key (CHROMOSOME, START, END, MODIFICATION).
Sums N_VALID_COV and N_MOD across files.
Tracks in how many files each key appears.
Filters keys using:
- minimum file count (--min-files)
- minimum file coverage percentage (--min-file-coverage)
- minimum total reads (--min-reads)
Writes merged rows with TOTAL_FILES and N_FILES.

Usage:

uv run modalysis pileup merge \
  --pileup-paths /path/to/a.modalysis /path/to/b.modalysis \
  --output-path /path/to/output_dir \
  --output-name merged_mod \
  [--min-files 2] \
  [--min-file-coverage 50.0] \
  [--min-reads 5] \
  [--port 8000]

Parameters:

Flag	Required	Default	Description
`--pileup-paths`	Yes	-	Formatted pileup inputs to merge.
`--output-path`	Yes	-	Output directory.
`--output-name`	Yes	-	Output basename.
`--min-files`	No	`2`	Minimum files containing key.
`--min-file-coverage`	No	`50.0`	Minimum `%` of files containing key.
`--min-reads`	No	`5`	Minimum summed `N_VALID_COV`.
`--port`	No	`8000`	Server port.

Output:

/path/to/output_dir/merged_mod.modalysis

`modalysis dmr format`

Purpose: Filter and normalize raw DMR rows into a consistent .modalysis table.

Algorithm:

Reads raw DMR rows.
Keeps only rows with exactly 23 columns.
Filters by allowed chromosomes.
Applies thresholds on score, p-value, sample percentages, and read counts.
Writes retained rows with columns: CHROMOSOME, START, END, SCORE, MAP_BASED_P_VALUE, EFFECT_SIZE, PCT_A_SAMPLES, PCT_B_SAMPLES.

Usage:

uv run modalysis dmr format \
  --input-path /path/to/raw_dmr.bed \
  --output-path /path/to/output_dir \
  --output-name dmr_formatted \
  --allowed-chromosomes /path/to/allowed_chromosomes.txt \
  [--min-score 5] \
  [--max-p-value 0.05] \
  [--min-pct-a-samples 50.0] \
  [--min-pct-b-samples 50.0] \
  [--min-reads 5] \
  [--port 8000]

Parameters:

Flag	Required	Default	Description
`--input-path`	Yes	-	Raw DMR file path.
`--output-path`	Yes	-	Output directory.
`--output-name`	Yes	-	Output basename.
`--allowed-chromosomes`	Yes	-	File with one valid chromosome per line.
`--min-score`	No	`5`	Keep rows with score >= this value.
`--max-p-value`	No	`0.05`	Keep rows with p-value <= this value.
`--min-pct-a-samples`	No	`50.0`	Minimum `%` A-group samples.
`--min-pct-b-samples`	No	`50.0`	Minimum `%` B-group samples.
`--min-reads`	No	`5`	Minimum read count in both groups.
`--port`	No	`8000`	Server port.

Output:

/path/to/output_dir/dmr_formatted.modalysis

`modalysis dmr annotate`

Purpose: Annotate each formatted DMR interval with overlapping gene regions.

Algorithm:

Parses formatted GFF genes by chromosome.
Builds promoter/body/enhancer regions per gene.
For each DMR interval, finds overlapping genes in each region.
Appends columns PROMOTER, BODY, ENHANCER (comma-separated gene IDs).

Usage:

uv run modalysis dmr annotate \
  --dmr-path /path/to/dmr_formatted.modalysis \
  --gff-path /path/to/formatted_gff.modalysis \
  --output-path /path/to/output_dir \
  --output-name dmr_annotated \
  [--port 8000]

Parameters:

Flag	Required	Default	Description
`--dmr-path`	Yes	-	Formatted DMR `.modalysis` path.
`--gff-path`	Yes	-	Formatted GFF `.modalysis` path.
`--output-path`	Yes	-	Output directory.
`--output-name`	Yes	-	Output basename.
`--port`	No	`8000`	Server port.

Output:

/path/to/output_dir/dmr_annotated.modalysis

`modalysis dmr gene-counts`

Purpose: Count unique genes by manifestation, expression profile, effect sign, modification, and region.

Algorithm:

Validates all list arguments have compatible lengths.
Loads expression mapping from annotated GFF EXPRESSION field.
Reads annotated DMR files and groups genes by: (manifestation, expression_profile, effect_sign, modification, region).
Uses unique gene sets to avoid duplicate counts.
Writes TSV summary rows.
Optional: writes grouped-header Excel workbook (--output-excel).

Usage:

uv run modalysis dmr gene-counts \
  --annotated-dmr-paths /path/to/a.modalysis /path/to/b.modalysis \
  --manifestations M1 M1 \
  --modifications 5MC 5MC_5HMC \
  --manifestation-labels M1 \
  --expression-labels tissue_1 \
  --annotated-gff-path /path/to/gff_annotated.modalysis \
  --output-path /path/to/output_dir \
  --output-name gene_counts \
  [--output-excel] \
  [--port 8000]

Parameters:

Flag	Required	Default	Description
`--annotated-dmr-paths`	Yes	-	Annotated DMR inputs.
`--manifestations`	Yes	-	Manifestation label per DMR input.
`--modifications`	Yes	-	Modification label per DMR input.
`--manifestation-labels`	Yes	-	Canonical manifestation labels used for expression matching.
`--expression-labels`	Yes	-	Expression labels mapped to `--manifestation-labels` in order.
`--annotated-gff-path`	Yes	-	Annotated GFF with `EXPRESSION` column.
`--output-path`	Yes	-	Output directory.
`--output-name`	Yes	-	Output basename.
`--output-excel`	No	`False`	Also write `.xlsx` workbook.
`--port`	No	`8000`	Server port.

Output:

/path/to/output_dir/gene_counts.modalysis
Optional: /path/to/output_dir/gene_counts.xlsx

`modalysis dmr common-genes`

Purpose: Find genes shared between two modifications for each manifestation and region.

Algorithm:

Validates list lengths and that modification A/B differ.
Loads gene expression status from annotated GFF.
From annotated DMRs, collects genes from negative effect-size rows only.
For each manifestation and region, computes set intersection between modification A and B.
Writes summary rows and per-gene rows including expression status.

Usage:

uv run modalysis dmr common-genes \
  --annotated-dmr-paths /path/to/a.modalysis /path/to/b.modalysis \
  --manifestations M1 M1 \
  --modifications 5MC 5MC_5HMC \
  --manifestation-labels M1 \
  --expression-labels tissue_1 \
  --modification-a 5MC \
  --modification-b 5MC_5HMC \
  --annotated-gff-path /path/to/gff_annotated.modalysis \
  --output-path /path/to/output_dir \
  --output-name common_genes \
  [--port 8000]

Parameters:

Flag	Required	Default	Description
`--annotated-dmr-paths`	Yes	-	Annotated DMR inputs.
`--manifestations`	Yes	-	Manifestation label per DMR input.
`--modifications`	Yes	-	Modification label per DMR input.
`--manifestation-labels`	Yes	-	Canonical manifestation labels used for expression matching.
`--expression-labels`	Yes	-	Expression labels mapped to manifestations by order.
`--modification-a`	Yes	-	First modification for intersection.
`--modification-b`	Yes	-	Second modification for intersection.
`--annotated-gff-path`	Yes	-	Annotated GFF with expression data.
`--output-path`	Yes	-	Output directory.
`--output-name`	Yes	-	Output basename.
`--port`	No	`8000`	Server port.

Output:

/path/to/output_dir/common_genes.modalysis

`modalysis plot mean-methylation`

Purpose: Plot mean methylation by chromosome, grouped by region (promoter/body/enhancer).

Algorithm:

Builds gene regions from formatted GFF.
For each merged pileup file, accumulates N_MOD / N_VALID_COV by chromosome and region.
Draws line plots across region-partitioned x-axis.
Supports optional chromosome ordering and custom title.

Usage:

uv run modalysis plot mean-methylation \
  --gff-path /path/to/formatted_gff.modalysis \
  --merged-pileup-paths /path/to/m1.modalysis /path/to/m2.modalysis \
  --labels 5MC 5MC_5HMC \
  --output-path /path/to/output_dir \
  --output-name mean_methylation \
  [--y-min 0.0] \
  [--y-max 0.1] \
  [--chromosome-order-path /path/to/order.txt] \
  [--plot-title "Custom Title"] \
  [--port 8000]

Parameters:

Flag	Required	Default	Description
`--gff-path`	Yes	-	Formatted GFF `.modalysis`.
`--merged-pileup-paths`	Yes	-	One or more merged pileup `.modalysis` files.
`--labels`	Yes	-	Display label per merged pileup path.
`--output-path`	Yes	-	Output directory.
`--output-name`	Yes	-	Output basename (`.png`).
`--y-min`	No	`0.0`	Y-axis lower bound.
`--y-max`	No	`0.1`	Y-axis upper bound.
`--chromosome-order-path`	No	`None`	Optional chromosome ordering file.
`--plot-title`	No	`None`	Optional plot title override.
`--port`	No	`8000`	Server port.

Output:

/path/to/output_dir/mean_methylation.png

`modalysis plot gene-heatmap`

Purpose: Generate gene-level heatmaps for manifestation/expression/effect-sign/modification combinations.

Algorithm:

Builds manifestation->expression label mapping.
Loads per-gene expression from annotated GFF.
Collects genes per combination from annotated DMRs.
Accumulates per-gene methylation means from merged pileups.
Renders one heatmap per non-empty combination with shared color scale.

Usage:

uv run modalysis plot gene-heatmap \
  --annotated-dmr-paths /path/to/dmr1.modalysis /path/to/dmr2.modalysis \
  --manifestations M1 M1 \
  --modifications 5MC 5MC_5HMC \
  --manifestation-labels M1 \
  --expression-labels tissue_1 \
  --annotated-gff-path /path/to/gff_annotated.modalysis \
  --gff-path /path/to/formatted_gff.modalysis \
  --merged-pileup-paths /path/to/p1.modalysis /path/to/p2.modalysis \
  --pileup-manifestations M1 M1 \
  --pileup-modifications 5MC 5MC_5HMC \
  --output-path /path/to/output_dir \
  --output-name heatmap \
  [--show-gene-labels] \
  [--effect-signs NEGATIVE NON_NEGATIVE] \
  [--port 8000]

Parameters:

Flag	Required	Default	Description
`--annotated-dmr-paths`	Yes	-	Annotated DMR inputs.
`--manifestations`	Yes	-	Manifestation label per DMR input.
`--modifications`	Yes	-	Modification label per DMR input.
`--manifestation-labels`	Yes	-	Canonical manifestation labels.
`--expression-labels`	Yes	-	Expression labels aligned to manifestation labels.
`--annotated-gff-path`	Yes	-	Annotated GFF with `EXPRESSION`.
`--gff-path`	Yes	-	Formatted GFF for gene coordinates.
`--merged-pileup-paths`	Yes	-	Merged pileup inputs.
`--pileup-manifestations`	Yes	-	Manifestation label per merged pileup path.
`--pileup-modifications`	Yes	-	Modification label per merged pileup path.
`--output-path`	Yes	-	Output directory.
`--output-name`	Yes	-	Output prefix (`<prefix>_<combo>.png`).
`--show-gene-labels`	No	`False`	Show gene IDs on y-axis.
`--effect-signs`	No	both	Restrict to `NEGATIVE` and/or `NON_NEGATIVE`.
`--port`	No	`8000`	Server port.

Output:

Multiple PNG files like /path/to/output_dir/heatmap_<...>.png

`modalysis plot dmr-dotplot`

Purpose: Plot DMR positions within promoter/body/enhancer panels for each gene.

Algorithm:

Loads expression states and gene coordinates.
Converts each DMR to region-relative position:
- promoter: distance to gene start
- body: percent through gene body
- enhancer: distance from gene end
Groups positions by manifestation/expression/effect-sign/modification/gene.
Renders one 3-panel dotplot per non-empty combination.
Draws consensus windows containing many distinct genes.

Usage:

uv run modalysis plot dmr-dotplot \
  --annotated-dmr-paths /path/to/dmr1.modalysis /path/to/dmr2.modalysis \
  --manifestations M1 M1 \
  --modifications 5MC 5MC_5HMC \
  --manifestation-labels M1 \
  --expression-labels tissue_1 \
  --annotated-gff-path /path/to/gff_annotated.modalysis \
  --gff-path /path/to/formatted_gff.modalysis \
  --output-path /path/to/output_dir \
  --output-name dotplot \
  [--show-gene-labels] \
  [--effect-signs NEGATIVE NON_NEGATIVE] \
  [--port 8000]

Parameters:

Flag	Required	Default	Description
`--annotated-dmr-paths`	Yes	-	Annotated DMR inputs.
`--manifestations`	Yes	-	Manifestation label per DMR input.
`--modifications`	Yes	-	Modification label per DMR input.
`--manifestation-labels`	Yes	-	Canonical manifestation labels.
`--expression-labels`	Yes	-	Expression labels aligned to manifestation labels.
`--annotated-gff-path`	Yes	-	Annotated GFF with `EXPRESSION`.
`--gff-path`	Yes	-	Formatted GFF for coordinates.
`--output-path`	Yes	-	Output directory.
`--output-name`	Yes	-	Output prefix (`<prefix>_<combo>.png`).
`--show-gene-labels`	No	`False`	Show gene IDs.
`--effect-signs`	No	both	Restrict to `NEGATIVE` and/or `NON_NEGATIVE`.
`--port`	No	`8000`	Server port.

Output:

Multiple PNG files like /path/to/output_dir/dotplot_<...>.png

`modalysis plot common-genes-venn`

Purpose: Plot Venn diagrams of common negative-DMR genes for two modifications.

Algorithm:

From annotated DMR inputs, keeps only rows with negative effect size.
Collects gene sets by (manifestation, modification, region).
For each manifestation and each region, draws set overlap panel for modification A vs B.

Usage:

uv run modalysis plot common-genes-venn \
  --annotated-dmr-paths /path/to/dmr1.modalysis /path/to/dmr2.modalysis \
  --manifestations M1 M1 \
  --modifications 5MC 5MC_5HMC \
  --modification-a 5MC \
  --modification-b 5MC_5HMC \
  --output-path /path/to/output_dir \
  --output-name common_venn \
  [--port 8000]

Parameters:

Flag	Required	Default	Description
`--annotated-dmr-paths`	Yes	-	Annotated DMR inputs.
`--manifestations`	Yes	-	Manifestation label per DMR input.
`--modifications`	Yes	-	Modification label per DMR input.
`--modification-a`	Yes	-	First modification to compare.
`--modification-b`	Yes	-	Second modification to compare.
`--output-path`	Yes	-	Output directory.
`--output-name`	Yes	-	Output basename (`.png`).
`--port`	No	`8000`	Server port.

Output:

/path/to/output_dir/common_venn.png

Troubleshooting

ConnectionError / request failures:
- Ensure uv run modalysis server is running on the same port passed to command --port.
Validation errors about list lengths:
- In DMR/plot aggregation commands, ensure paired list arguments have matching lengths and consistent ordering.
Empty/near-empty outputs:
- Relax thresholds such as --min-score, --max-p-value, --min-file-coverage, --min-reads.
- Verify chromosome naming in input files matches your allowed chromosome list.
ValueError: Modification A and B must be different:
- Use distinct values for --modification-a and --modification-b.

Testing

Run the full suite:

uv run pytest -q

Run with coverage:

uv run pytest --cov=modalysis --cov-report=term-missing

Run focused suites:

uv run pytest tests/core -q
uv run pytest tests/server -q
uv run pytest tests/client -q
uv run pytest tests/cli -q
uv run pytest tests/e2e -q

Build Docs

Build the Sphinx site:

uv run sphinx-build -b html docs docs/_build/html

Open docs/_build/html/index.html in a browser.

pnpm wrangler pages deploy docs/_build/html --project-name modalysis

modalysis

Prerequisites

Required Input Types

Output Types

Recommended Pipeline Order

Command Reference

modalysis server

modalysis gff format

modalysis gff annotate

modalysis pileup format

modalysis pileup merge

modalysis dmr format

modalysis dmr annotate

modalysis dmr gene-counts

modalysis dmr common-genes

modalysis plot mean-methylation

modalysis plot gene-heatmap

modalysis plot dmr-dotplot

modalysis plot common-genes-venn

Troubleshooting

Testing

Build Docs

`modalysis server`

`modalysis gff format`

`modalysis gff annotate`

`modalysis pileup format`

`modalysis pileup merge`

`modalysis dmr format`

`modalysis dmr annotate`

`modalysis dmr gene-counts`

`modalysis dmr common-genes`

`modalysis plot mean-methylation`

`modalysis plot gene-heatmap`

`modalysis plot dmr-dotplot`

`modalysis plot common-genes-venn`