Running the Workflow

The full NSBI pipeline can be executed step-by-step or orchestrated via HTCondor DAGMan on a cluster. We will soon add Snakemake as an option for workflow orchestration, being agnostic to the computing infrastructure and thus allowing runs on HPC, HTC or even a personal laptop.

Below is an example workflow using the FAIR Universe \(H\to \tau\tau\) dataset.

All pipeline scripts are driven by a single configuration file, config.pipeline.yaml, located at the root of each example directory (e.g. examples/FAIR_universe_Higgs_tautau/config.pipeline.yaml). This file defines dataset paths, training hyperparameters, ensemble sizes, systematic variations, and fit settings. Inspect the example config to understand the available options.

Pipeline overview

Local (sequential) execution

From the example directory (examples/FAIR_universe_Higgs_tautau/):

# 1. Load and preprocess data
python scripts/data_loader.py --config config.pipeline.yaml
python scripts/data_preprocessing.py --config config.pipeline.yaml

# 2. Train preselection network (region classifier)
python scripts/preselection_network.py --config config.pipeline.yaml

# 3. Train nominal density-ratio ensembles (per process)
python scripts/neural_likelihood_ratio_estimation.py \
    --config config.pipeline.yaml --process htautau --ensemble_index 0

# 4. Train systematic variation networks
python scripts/systematic_uncertainty_training.py \
    --config config.pipeline.yaml --process htautau --systematic JES --direction Up

# 5. Evaluate all trained models on the Asimov dataset
python scripts/data_nn_eval.py --config config.pipeline.yaml

# 6. Build workspace and fit
python scripts/parameter_fitting.py --config config.pipeline.yaml

Steps 3 and 4 are embarrassingly parallel across processes, ensemble members, and systematic variations.

Cluster execution (HTCondor / DAGMan)

The htcondor/ directory contains submit descriptions and DAG files that orchestrate the pipeline on CHTC, via the configuration file config.pipeline.yaml`:

htcondor/
  workflow_full.dag                  # top-level DAG submitting the full end-to-end workflow
  stage_data_processing.dag          # data loading and processing DAG
  stage_preselection_network.dag     # train signal- and control-region selection neural network
  stage_density_ratio_training.dag   # top-level density ratio estimation and evaluation DAG
      generate_training_dag.py           # generates train_ensemble.dag dynamically
      generate_systematics_dag.py        # generates train_systematics.dag dynamically
      train_ensemble.dag                 # one job per (process, ensemble_index)
      train_systematics.dag              # one job per (process, systematic, direction)
  stage_parameter_fitting.dag        # Build model and fit parameters for statistical inference

Submit the full pipeline:

condor_submit_dag examples/FAIR_universe_Higgs_tautau/htcondor/workflow_full.dag

The full DAG structure, including ensemble parallelism:

Submit the training and evaluation pipeline (targetted submitting for optimizations):

condor_submit_dag examples/FAIR_universe_Higgs_tautau/htcondor/stage_density_ratio_training.dag

Stage 3 (density ratio training) in detail:

DAGMan handles:

SCRIPT PRE — dynamically generates the training DAGs by reading the pipeline config (number of ensemble members, systematic variations, etc.).
SUBDAG EXTERNAL — submits the generated DAGs as nested sub-workflows.
PARENT/CHILD — ensures evaluation runs only after all training completes.
RETRY — automatically retries failed jobs (transient GPU errors, etc.).

File transfer

Each job transfers the source code and example directory to the execute point via transfer_input_files. Trained model outputs are transferred back per-job to unique directories (keyed by process and ensemble index) to avoid overwrites from concurrent jobs:

transfer_output_files = .../output_model_params_$(PROCESS_TYPE)$(ENSEMBLE_INDEX),
                        .../output_figures_$(PROCESS_TYPE)$(ENSEMBLE_INDEX),
                        .../output

The evaluation job (data_nn_eval) transfers the full saved_datasets/ back since it is a single job with no concurrency risk.

Adapting to your cluster

The HTCondor setup under htcondor/ is written for the CHTC pool at UW-Madison and will not work out of the box on other clusters. To adapt it you will need to modify at minimum:

Submit descriptions (*.sub files) — resource requests (request_gpus, request_memory), container image or requirements classad, and transfer_input_files / transfer_output_files paths to match your storage layout.
``config.pipeline.yaml`` — update all dataset and output paths to reflect your directory structure.
Environment setup — the submit files assume a specific container or software stack; replace with your site’s equivalent (conda/pixi env, Apptainer image, module loads, etc.).

If your site uses a different batch system (SLURM, PBS, etc.) you can still use the local sequential commands above and wrap them in the appropriate job scripts. Snakemake support (infrastructure-agnostic) is planned.