LBatch¶

This image link is broken unless viewed in mkdocs.

lbatch is a system for creating SLURM pipelines to process airborne LiDAR data into the following raster outputs:

canopy-base-height
canopy-cover
canopy-height
flame-gap
ladder-fuel-density
vertical-layer-count
vertical-profile

Getting Started¶

Prerequisites¶

The following software must be installed to run lbatch:

wine
LAStools
lazer (from the conda environment)

The lbatch-generator and lbatch-manager scripts are installed during the lazer install.

Testing¶

Enable the lazer environment
```
conda activate lazer
```
Generate a jobs dictionary with lbatch-generator. This uses the example/las.example.yaml site configuration file, which contains settings to process the WA/fluidnumerics site. You can run this command from the base salo-lidar repo directory.
```
lbatch-generator lbatch/example/las.example.yaml
```
Review the jobs dictionary, which contains Slurm submission instructions, like compute resource allocation and dependency tracking.
```
cat ./jobs.json
```
Dry-run the jobs using lbatch-manager to ensure the Slurm submission is valid.
```
lbatch-manager --dry-run ./jobs.json
```
Submit the full pipeline
```
nohup lbatch-manager ./jobs.json &
```

Jobs can be monitored using the squeue Slurm command, and you can monitor logs for the run by checking ./nohup.out or ./batch-manager.log.

Many of the default file paths can be configured as commandline arguments, but are set to defaults to reduce mental overhead.

Introduction¶

The lbatch system is a framework for creating pipelines to process LiDAR point-cloud data into raster products using the Slurm job scheduler. Lbatch works as a template engine (batch-generator) and job submission and monitoring tool (batch-manager).

By pipeline, we mean an ordered list of commands that need to be run in sequence to transform input into the desired output. For lbatch, the sequence of commands consist of commands supported by laser in addition to commands to move data between Google Compute Engine resources and Google Cloud Storage.

Each laser command takes in input files and returns output files. The locations for the input and output are determined at runtime. Additionally, each laser command has optional arguments that are set depending on the attributes of the raw LiDAR data provided at the start of the pipeline. Last, the sequence of laser commands to run also depends on raw LiDAR data attributes.

For these reasons, lbatch is designed as a "pipeline concretizer" that is capable of creating concrete pipelines that depend on the input provided to it.

Directory Structure¶

The lbatch repository contains the following directories:

bin/ - Contains the lbatch-generator and lbatch-manager scripts
etc/ - Contains configurations for lbatch-generator and template pipelines (aka "workflows").
example/ - Contains example sites specification file and sample results produced by the lbatch-generator reporting capabilities

How does lbatch work?¶

lbatch runs in two stages:

lbatch-generator
lbatch-manager

In essence, the lbatch-generator generates a jobs dictionary and the lbatch-manager schedules jobs for execution (via Slurm) to the appropriate compute partitions and with the necessary dependencies.

Batch Generator¶

Overall, the lbatch-generator stage parses a provided sites specification file and creates a single json output that describes all of the jobs that need to be run. After interpreting a sites specification file, sample data is downloaded from the appropriate Google Cloud Storage location for each site and the template workflow is chosen using one of the etc/workflow-*.yaml definitions.

The logic for choosing the template workflow is defined in the setWorkflow routine in bin/batch-manager. Once the default workflow is defined, the lbatch-generator defines jobs for the workflow by concretizing the working directory, compute partition, memory required, cpus per task, job array sizes, job scripts, and job dependencies (by job name). Once the jobs are concretized, lbatch-generator finishes by writing the jobs dictionary to file.

lbatch-generator CLI¶

The generator can be invoked using the lbatch-generator python script.

usage: lbatch-generator [-h] [--array-size ARRAY_SIZE]
                        [--task-count TASK_COUNT] [--schema SCHEMA]
                        [--conf CONF] [--workflow-path WORKFLOW_PATH]
                        [--log-directory LOG_DIRECTORY]
                        sites_spec [job_dictionary]

Generate a job dictionary for carrying out laz file processing

positional arguments:
  sites_spec            Lazer pipeline config.
  job_dictionary        Workflow dictionary. Defaults to ./jobs.json

optional arguments:
  -h, --help            show this help message and exit
  --array-size ARRAY_SIZE
                        Maximum number of array jobs to submit
  --task-count TASK_COUNT
                        Maximum number of simultaneous jobs to submit in the
                        job array
  --schema SCHEMA       Full path to a job dictionary schema to use.
  --conf CONF           Full path to the lbatch-manager configuration file.
  --workflow-path WORKFLOW_PATH
                        Full path to the workflow definition files.
  --log-directory LOG_DIRECTORY
                        Full path to directory to write logs to.

Batch Manager¶

The Batch Manager is responsible for reading and validating a jobs dictionary produced by the Batch Generator and scheduling work for execution. While submitting jobs, the batch manager maps the job names defined in the jobs dictionary to Slurm job ID's that are used to resolve dependencies between jobs; this feature allows the manager to leverage the --dependency flag for sbatch so that jobs run in the preferred order.

Once jobs are submitted, the batch manager regularly queries Slurm to obtain job status. If any job fails or is canceled, the whole pipeline is assumed to fail and the batch manager cancels all remaining jobs. When a job completes, the batch manager queries Slurm for the exit code and runtime for the job. This information is aligned with the job dictionary information (such as the step name, command executed, and compute partition) so that runtime, cpu-hours, and node-hours (proportional to cost on Google Cloud) can be calculated. This information is a necessary ingredient for performance and cost optimization.

lbatch-manager CLI¶

The manager can be invoked using the lbatch-manager python script.

usage: lbatch-manager [-h] [--schema SCHEMA] [--log-directory LOG_DIRECTORY]
                      [--dry-run]
                      job_dictionary

Manage Slurm batch job submission

positional arguments:
  job_dictionary        A defined job dictionary specifying the batch scripts
                        to run, the partitions to submit them to, job
                        dependencies, and any additional batch options.

optional arguments:
  -h, --help            show this help message and exit
  --schema SCHEMA       Full path to a job dictionary schema to use.
  --log-directory LOG_DIRECTORY
                        Full path to directory to write logs to.
  --dry-run             Enable dry runs. When set, the planned commands are
                        surfaced to stdout and no jobs are submitted.

Reference¶

Sites Specification File (Batch Generator Input)¶

A sites specification file defines global settings alongside and array of settings for individual sites for processing. An example is given below

settings:
  workspace: '/apps/workspace/'
  gcs_bucket: 'nfo-lidar'
  ltbase: '/apps/LASTools/bin'

sites:
    - name: 'fluidnumerics'
      state: 'WA'
      sub: 'laz'
      espg: '32610'
      overrides:
        - step: 'tile'
          opts:
          - key: 'cpus_per_task'
            value: 8

Settings¶

settings.workspace - This is the root level directory for creating working directories for each step in the build process. The for each execution of lbatch-generator, a directory is created at {settings.workspace}/(hex8(utime)) where (hex8(utime)) is a hexadecimal representation of the UNIX time when the lbatch-generator script was executed.
settings.gcs_bucket : The name of the Google Cloud Storage bucket that hosts the raw unprocessed data for all sites defined in this sites specifcation file and where all processed data will be posted.
settings.ltbase : The path to the LASTools installation. Each generated job script sets the environment variable LTBASE equal to the value of settings.ltbase

Sites¶

sites[].name : The name of the site. The name must match the site name identified in the directory structure of the raw unprocessed data {STATE}/raw-unprocessed/{SITENAME}/{SUB}
sites[].state : The state id for the site. The state must match the site name identified in the directory structure of the raw unprocessed data {STATE}/raw-unprocessed/{SITENAME}/{SUB}
sites[].sub : The extension, either las or laz for files in the directory. The sub must match the site name identified in the directory structure of the raw unprocessed data {STATE}/raw-unprocessed/{SITENAME}/{SUB}
sites[].espg : Spatial projection to project data to when calling laser reproject. This defines the {ESPG} template variable in the template workflows.
sites[].overrides[] : A list object that is used to override the cpus_per_task, partition, memory, and job_array_size for each step. Structure documented below

Site Overrides

sites[].overrides[].step : The name of the step, as defined in one of the etc/workflow-*.yaml, to override default settings on. Note that the cpus_per_task, partition, memory, job_array_size are inherited from the cli defined in etc/batch-generator.conf.yaml.
sites[].opts[].key : The setting that you want to override. One of cpus_per_task, partition, memory, or job_array_size.
sites[].opts[].value : The value for the setting you want to override.

Workflow Templates (Batch Generator Input)¶

Because the steps that need to be run depends on the nature of the input for each site, different workflow templates are provided. In general, the workflow templates address cases that depend on whether or not the raw data is in UTM coordinates or whether or not grounds points are classified.

The workflow template files are JSON file that specify a list of objects that define steps in a workflow and the dependency on other steps.

workflow[].name - The name of the step in the workflow. This name must be unique for the given workflow.
workflow[].cli - The command to run for this step. The command is defined in the Batch Generator configuration file
workflow[].depends - List of workflow names that this step depends on. If no dependencies exist, set to []
workflow[].opts- Command line options to substitute in for the {OPTS} template variable. Note that the {OPTS} variable is used in the command specifications in the Batch Generator Configuration file
workflow[].odir - The output directory for this step. This variable is swapped out for the {OUTPUT_DIRECTORY} template variable and in most cases is passed to the -odir flag for laser commands.
workflow[].inputs (Optional) - If provided, the workflow inputs overrides the default inputs provided in the Batch Generator Configuration file for a given cli.

There are currently four workflow templates defined in etc/workflow-*.json

The path to these files is controlled by the --workflow-path command line argument for the lbatch-generator

Batch Generator Configuration File (Batch Generator Input)¶

The Batch Generator Configuration File is used to define template commands that can be used to construct build pipelines. Each "pipeline command" consists of a command, default number of cpus per task, default partition, job type, and regex for searching for input.

Each entry in the dictionary follows the schema

cli:
    pipeline_command:
      command: 'echo "hello world"'
      cpus_per_task: 32
      partition: 'e2-standard-32'
      jobType: 'singleton'
      inputs: '*.laz'

The attributes of a pipeline command are defined below:

command (string) : The templated command to run for this pipeline command
cpus_per_task (number) : The default number of CPUs to use for running this command.
partition (string) : The default Slurm partition to use for running this command.
jobType (string) : One of either singleton or array. This is used to determine if the batch template is for a single task batch job (singleton) or a job array (array)
inputs : The files (under input directory) to search for to generate the input file list dynamically.

Jobs Dictionary (Batch Generator Output / Batch Manager Input)¶

The jobs dictionary file contains the concretized list of jobs that need to be scheduled. The schema for the jobs dictionary is defined in etc/batch-dictionary.schema.json.

Working Directory Structure¶

When an lbatch pipeline is executed, a unique workspace is set up to avoid potential conflict with other pipelines that may be running simultaneously. The root of the directory is set in the sites specifcation file, under settings.workspace. This variable allows you to control what file system is used to handle file IO; this location is required to be a mount point that is visible to all compute instances in your cluster.

Under the root directory, the lbatch-generator script creates a directory named as the hexadecimal representation of the UNIX time plus a random integer. This is the top-level working directory for a specific pipeline execution.

Under the top-level working directory, the following directories are created

{STATE}/raw-unprocessed/{SITENAME}/{SUB} - The directory used for storing data downloaded from Google Cloud Storage.
{STATE}/processing/{SITENAME}/{STEP} - The directory used for storing temporary output for each step.
{STATE}/scripts/{SITENAME}/{STEP} - The directory used for storing temporary batch scripts, stdout, and stderr for each step in the pipeline.
{STATE}/processed/{SITENAME} - The directory used for storing the output from the complete pipeline execution. This directory is loaded back into Google Cloud Storage at the end of the pipeline execution.

In the directory hierarchy, the template variables are defined as follows

{STATE} - The state where the LIDAR data comes from. Resolves to {site_spec.sites.state}
{SITENAME} - The name of the LiDAR site. Resolves to {site_spec.sites.name}
{STEP} - The name step in the workflow. Resolves to {workflow[i].name}
{SUB} - The extension of the source files, either las or laz. Resolves to {site_spec.sites.sub}

Template Variables¶

When specifying commands, you can use the following template variables :

{WORKSPACE} : The path to a unique workspace created for processing a site. At runtime, lbatch-generator sets this variable equal to {sites_spec.settings.workspace}/(hex(utime+random)); the directory root is specified in your sites specification file and the subdirectory is the unix-time plus a random integer converted to hexadecimal notation. This ensures that each execution of the pipeline for a site will be in an isolated directory set.
{SITE_PATH} : The relative path to the unprocessed LiDAR data. At runtime this is set to {sites_spec.sites.state}/raw-unprocesses/{sites_spec.sites.name}/{sites_spec.sites.sub}; This is used to download data from Google Cloud Storage (gs://{GCS_BUCKET}/{SITE_PATH}) and keep the directory structure consistent on the cluster ({WORKSPACE}/{SITE_PATH})
{GCS_BUCKET} : The name of the Google Cloud Storage bucket hosting the LiDAR raw data and where processed data will be pushed; At runtime this is set to sites_spec.settings.gcs_bucket
{CPUS_PER_TASK} : The number of vCPUs to make available for the command. The default value is set in the batch generator configuration file, but can be overridden by sites_spec.sites.overrides.opts by setting sites_spec.sites.overrides.step to the pipeline step name, sites_spec.sites.overrides.opts.key: "cpus_per_task" and sites_spec.sites.overrides.opts.value to the number of vCPUs per task.
{INPUTS} : This is the input file regex used to generate the input file list. Resolves to {conf.{cli}.inputs}
{INPUT_DIRECTORY}:
{INPUT_LIST} : This is the input file list that is generated at runtime for the given step. When the step has a dependency, the input file list is resolved as the list of outputs generated by the first listed dependency.
{OPTS} : This resolves to the options specified in the one of the workflow template files (under etc/) in addition to other options that are determined based on the attributes of the input raw laz files.
{OUTPUT_DIRECTORY} : This is the directory to post output. For jobs that have the stage set to intermediate, this resolves to {WORKSPACE}/{sites_spec.site.state}/processing/{sites_spec.site.name}/{step_name}; when stage is set to processed, this resolves to `{WORKSPACE}/{sites_spec.site.state}/processed/{sites_spec.site.name}/
{ODIX}
{SITE}
{ESPG} : Spatial projection to project data to when calling laser reproject. This resolves to {sites_spec.site.espg}
{LTBASE} : Path to LASTools bin. Resolves to {sites_spec.settings.ltbase}

How Pipelines are Defined¶

A concretized pipeline begins its life as one of the template pipelines specified in the etc/ directory.

etc/workflow-default.json : A template pipeline that is used when data is in UTM coordinates and ground points are already classified
etc/workflow-noground.json : A template pipeline that is used when data is in UTM coordinates but ground points are not classified
etc/workflow-notutm.json : A template pipeline that is used when data is not in UTM coordinates.

The decision about which pipeline to use is made in the lbatch-generator in the setWorkflow routine; this decision is based on information returned by lazer.pdal.info(path)

Once the template pipeline is chosen, the lbatch-generator proceeds by concretizing the template variables and resolving dependencies for each step. The result of running the lbatch-generator is a concretized job dictionary.

The Job Dictionary File¶

The job dictionary file is a set of concretized batch jobs to submit to Slurm. Each job specified by the following attributes:

site - The name of the LiDAR site being processed
name - A unique name for the job (within the scope of the dictionary)
script - The path to the batch script that will be executed
dependencies - A list of job names (also defined in the same dictionary) that the job depends on
partition - The compute partition to submit the job to
batch_options - The options to send to sbatch at scheduling
workspace - The working directory to submit the job from. This is the same directory where stderr and stdout will be saved
run_id - The unique identifier for the specific execution of a pipeline execution.

The job dictionary is ingested by the lbatch-manager, which schedules the complete pipeline using sbatch commands. The lbatch-manager resolves dependencies to the Slurm job id so that execution of each step in the pipeline is carried out in the intended order.