LBatch¶
© 2021 Salo Sciences
lbatch
is a system for creating SLURM pipelines to process airborne LiDAR data into the following raster outputs:
- canopy-base-height
- canopy-cover
- canopy-height
- flame-gap
- ladder-fuel-density
- vertical-layer-count
- vertical-profile
Getting Started¶
Prerequisites¶
The following software must be installed to run lbatch
:
- wine
- LAStools
- lazer (from the conda environment)
The lbatch-generator
and lbatch-manager
scripts are installed during the lazer
install.
Testing¶
- Enable the lazer environment
conda activate lazer
- Generate a jobs dictionary with
lbatch-generator
. This uses theexample/las.example.yaml
site configuration file, which contains settings to process theWA/fluidnumerics site
. You can run this command from the basesalo-lidar
repo directory.lbatch-generator lbatch/example/las.example.yaml
- Review the jobs dictionary, which contains Slurm submission instructions, like compute resource allocation and dependency tracking.
cat ./jobs.json
- Dry-run the jobs using
lbatch-manager
to ensure the Slurm submission is valid.lbatch-manager --dry-run ./jobs.json
- Submit the full pipeline
nohup lbatch-manager ./jobs.json &
Jobs can be monitored using the squeue
Slurm command, and you can monitor logs for the run by checking ./nohup.out
or ./batch-manager.log
.
Many of the default file paths can be configured as commandline arguments, but are set to defaults to reduce mental overhead.
Introduction¶
The lbatch
system is a framework for creating pipelines to process LiDAR point-cloud data into raster products using the Slurm job scheduler. Lbatch works as a template engine (batch-generator) and job submission and monitoring tool (batch-manager).
By pipeline, we mean an ordered list of commands that need to be run in sequence to transform input into the desired output. For lbatch
, the sequence of commands consist of commands supported by laser
in addition to commands to move data between Google Compute Engine resources and Google Cloud Storage.
Each laser
command takes in input files and returns output files. The locations for the input and output are determined at runtime. Additionally, each laser
command has optional arguments that are set depending on the attributes of the raw LiDAR data provided at the start of the pipeline. Last, the sequence of laser
commands to run also depends on raw LiDAR data attributes.
For these reasons, lbatch
is designed as a "pipeline concretizer" that is capable of creating concrete pipelines that depend on the input provided to it.
Directory Structure¶
The lbatch
repository contains the following directories:
bin/
- Contains thelbatch-generator
andlbatch-manager
scriptsetc/
- Contains configurations for lbatch-generator and template pipelines (aka "workflows").example/
- Contains example sites specification file and sample results produced by thelbatch-generator
reporting capabilities
How does lbatch work?¶
lbatch
runs in two stages:
lbatch-generator
lbatch-manager
In essence, the lbatch-generator
generates a jobs dictionary and the lbatch-manager
schedules jobs for execution (via Slurm) to the appropriate compute partitions and with the necessary dependencies.
Batch Generator¶
Overall, the lbatch-generator
stage parses a provided sites specification file and creates a single json output that describes all of the jobs that need to be run. After interpreting a sites specification file, sample data is downloaded from the appropriate Google Cloud Storage location for each site and the template workflow is chosen using one of the etc/workflow-*.yaml
definitions.
The logic for choosing the template workflow is defined in the setWorkflow
routine in bin/batch-manager
. Once the default workflow is defined, the lbatch-generator
defines jobs for the workflow by concretizing the working directory, compute partition, memory required, cpus per task, job array sizes, job scripts, and job dependencies (by job name). Once the jobs are concretized, lbatch-generator
finishes by writing the jobs dictionary to file.
lbatch-generator CLI¶
The generator can be invoked using the lbatch-generator
python script.
usage: lbatch-generator [-h] [--array-size ARRAY_SIZE]
[--task-count TASK_COUNT] [--schema SCHEMA]
[--conf CONF] [--workflow-path WORKFLOW_PATH]
[--log-directory LOG_DIRECTORY]
sites_spec [job_dictionary]
Generate a job dictionary for carrying out laz file processing
positional arguments:
sites_spec Lazer pipeline config.
job_dictionary Workflow dictionary. Defaults to ./jobs.json
optional arguments:
-h, --help show this help message and exit
--array-size ARRAY_SIZE
Maximum number of array jobs to submit
--task-count TASK_COUNT
Maximum number of simultaneous jobs to submit in the
job array
--schema SCHEMA Full path to a job dictionary schema to use.
--conf CONF Full path to the lbatch-manager configuration file.
--workflow-path WORKFLOW_PATH
Full path to the workflow definition files.
--log-directory LOG_DIRECTORY
Full path to directory to write logs to.
Batch Manager¶
The Batch Manager is responsible for reading and validating a jobs dictionary produced by the Batch Generator and scheduling work for execution. While submitting jobs, the batch manager maps the job names defined in the jobs dictionary to Slurm job ID's that are used to resolve dependencies between jobs; this feature allows the manager to leverage the --dependency
flag for sbatch
so that jobs run in the preferred order.
Once jobs are submitted, the batch manager regularly queries Slurm to obtain job status. If any job fails or is canceled, the whole pipeline is assumed to fail and the batch manager cancels all remaining jobs. When a job completes, the batch manager queries Slurm for the exit code and runtime for the job. This information is aligned with the job dictionary information (such as the step name, command executed, and compute partition) so that runtime, cpu-hours, and node-hours (proportional to cost on Google Cloud) can be calculated. This information is a necessary ingredient for performance and cost optimization.
lbatch-manager CLI¶
The manager can be invoked using the lbatch-manager
python script.
usage: lbatch-manager [-h] [--schema SCHEMA] [--log-directory LOG_DIRECTORY]
[--dry-run]
job_dictionary
Manage Slurm batch job submission
positional arguments:
job_dictionary A defined job dictionary specifying the batch scripts
to run, the partitions to submit them to, job
dependencies, and any additional batch options.
optional arguments:
-h, --help show this help message and exit
--schema SCHEMA Full path to a job dictionary schema to use.
--log-directory LOG_DIRECTORY
Full path to directory to write logs to.
--dry-run Enable dry runs. When set, the planned commands are
surfaced to stdout and no jobs are submitted.
Reference¶
Sites Specification File (Batch Generator Input)¶
A sites specification file defines global settings alongside and array of settings for individual sites for processing. An example is given below
settings:
workspace: '/apps/workspace/'
gcs_bucket: 'nfo-lidar'
ltbase: '/apps/LASTools/bin'
sites:
- name: 'fluidnumerics'
state: 'WA'
sub: 'laz'
espg: '32610'
overrides:
- step: 'tile'
opts:
- key: 'cpus_per_task'
value: 8
Settings¶
settings.workspace
- This is the root level directory for creating working directories for each step in the build process. The for each execution oflbatch-generator
, a directory is created at{settings.workspace}/(hex8(utime))
where(hex8(utime))
is a hexadecimal representation of the UNIX time when thelbatch-generator
script was executed.settings.gcs_bucket
: The name of the Google Cloud Storage bucket that hosts the raw unprocessed data for all sites defined in this sites specifcation file and where all processed data will be posted.settings.ltbase
: The path to the LASTools installation. Each generated job script sets the environment variableLTBASE
equal to the value ofsettings.ltbase
Sites¶
sites[].name
: The name of the site. Thename
must match the site name identified in the directory structure of the raw unprocessed data{STATE}/raw-unprocessed/{SITENAME}/{SUB}
sites[].state
: The state id for the site. Thestate
must match the site name identified in the directory structure of the raw unprocessed data{STATE}/raw-unprocessed/{SITENAME}/{SUB}
sites[].sub
: The extension, eitherlas
orlaz
for files in the directory. Thesub
must match the site name identified in the directory structure of the raw unprocessed data{STATE}/raw-unprocessed/{SITENAME}/{SUB}
sites[].espg
: Spatial projection to project data to when callinglaser reproject
. This defines the{ESPG}
template variable in the template workflows.sites[].overrides[]
: A list object that is used to override thecpus_per_task
,partition
,memory
, andjob_array_size
for each step. Structure documented below
Site Overrides
sites[].overrides[].step
: The name of the step, as defined in one of theetc/workflow-*.yaml
, to override default settings on. Note that thecpus_per_task
,partition
,memory
,job_array_size
are inherited from the cli defined inetc/batch-generator.conf.yaml
.sites[].opts[].key
: The setting that you want to override. One ofcpus_per_task
,partition
,memory
, orjob_array_size
.sites[].opts[].value
: The value for the setting you want to override.
Workflow Templates (Batch Generator Input)¶
Because the steps that need to be run depends on the nature of the input for each site, different workflow templates are provided. In general, the workflow templates address cases that depend on whether or not the raw data is in UTM coordinates or whether or not grounds points are classified.
The workflow template files are JSON file that specify a list of objects that define steps in a workflow and the dependency on other steps.
workflow[].name
- The name of the step in the workflow. This name must be unique for the given workflow.workflow[].cli
- The command to run for this step. The command is defined in the Batch Generator configuration fileworkflow[].depends
- List of workflow names that this step depends on. If no dependencies exist, set to[]
workflow[].opts
- Command line options to substitute in for the{OPTS}
template variable. Note that the{OPTS}
variable is used in thecommand
specifications in the Batch Generator Configuration fileworkflow[].odir
- The output directory for this step. This variable is swapped out for the{OUTPUT_DIRECTORY}
template variable and in most cases is passed to the-odir
flag forlaser
commands.workflow[].inputs
(Optional) - If provided, the workflow inputs overrides the defaultinputs
provided in the Batch Generator Configuration file for a given cli.
There are currently four workflow templates defined in etc/workflow-*.json
The path to these files is controlled by the --workflow-path
command line argument for the lbatch-generator
Batch Generator Configuration File (Batch Generator Input)¶
The Batch Generator Configuration File is used to define template commands that can be used to construct build pipelines. Each "pipeline command" consists of a command, default number of cpus per task, default partition, job type, and regex for searching for input.
Each entry in the dictionary follows the schema
cli:
pipeline_command:
command: 'echo "hello world"'
cpus_per_task: 32
partition: 'e2-standard-32'
jobType: 'singleton'
inputs: '*.laz'
The attributes of a pipeline command are defined below:
command
(string) : The templated command to run for this pipeline commandcpus_per_task
(number) : The default number of CPUs to use for running this command.partition
(string) : The default Slurm partition to use for running this command.jobType
(string) : One of eithersingleton
orarray
. This is used to determine if the batch template is for a single task batch job (singleton
) or a job array (array
)inputs
: The files (under input directory) to search for to generate the input file list dynamically.
Jobs Dictionary (Batch Generator Output / Batch Manager Input)¶
The jobs dictionary file contains the concretized list of jobs that need to be scheduled. The schema for the jobs dictionary is defined in etc/batch-dictionary.schema.json
.
Working Directory Structure¶
When an lbatch
pipeline is executed, a unique workspace is set up to avoid potential conflict with other pipelines that may be running simultaneously. The root of the directory is set in the sites specifcation file, under settings.workspace
. This variable allows you to control what file system is used to handle file IO; this location is required to be a mount point that is visible to all compute instances in your cluster.
Under the root directory, the lbatch-generator script creates a directory named as the hexadecimal representation of the UNIX time plus a random integer. This is the top-level working directory for a specific pipeline execution.
Under the top-level working directory, the following directories are created
{STATE}/raw-unprocessed/{SITENAME}/{SUB}
- The directory used for storing data downloaded from Google Cloud Storage.{STATE}/processing/{SITENAME}/{STEP}
- The directory used for storing temporary output for each step.{STATE}/scripts/{SITENAME}/{STEP}
- The directory used for storing temporary batch scripts, stdout, and stderr for each step in the pipeline.{STATE}/processed/{SITENAME}
- The directory used for storing the output from the complete pipeline execution. This directory is loaded back into Google Cloud Storage at the end of the pipeline execution.
In the directory hierarchy, the template variables are defined as follows
{STATE}
- The state where the LIDAR data comes from. Resolves to{site_spec.sites.state}
{SITENAME}
- The name of the LiDAR site. Resolves to{site_spec.sites.name}
{STEP}
- The name step in the workflow. Resolves to{workflow[i].name}
{SUB}
- The extension of the source files, eitherlas
orlaz
. Resolves to{site_spec.sites.sub}
Template Variables¶
When specifying commands, you can use the following template variables :
{WORKSPACE}
: The path to a unique workspace created for processing a site. At runtime,lbatch-generator
sets this variable equal to{sites_spec.settings.workspace}/(hex(utime+random))
; the directory root is specified in your sites specification file and the subdirectory is the unix-time plus a random integer converted to hexadecimal notation. This ensures that each execution of the pipeline for a site will be in an isolated directory set.{SITE_PATH}
: The relative path to the unprocessed LiDAR data. At runtime this is set to{sites_spec.sites.state}/raw-unprocesses/{sites_spec.sites.name}/{sites_spec.sites.sub}
; This is used to download data from Google Cloud Storage (gs://{GCS_BUCKET}/{SITE_PATH}
) and keep the directory structure consistent on the cluster ({WORKSPACE}/{SITE_PATH}
){GCS_BUCKET}
: The name of the Google Cloud Storage bucket hosting the LiDAR raw data and where processed data will be pushed; At runtime this is set tosites_spec.settings.gcs_bucket
{CPUS_PER_TASK}
: The number of vCPUs to make available for the command. The default value is set in the batch generator configuration file, but can be overridden bysites_spec.sites.overrides.opts
by settingsites_spec.sites.overrides.step
to the pipeline step name,sites_spec.sites.overrides.opts.key: "cpus_per_task"
andsites_spec.sites.overrides.opts.value
to the number of vCPUs per task.{INPUTS}
: This is the input file regex used to generate the input file list. Resolves to{conf.{cli}.inputs}
{INPUT_DIRECTORY}
:{INPUT_LIST}
: This is the input file list that is generated at runtime for the given step. When the step has a dependency, the input file list is resolved as the list of outputs generated by the first listed dependency.{OPTS}
: This resolves to the options specified in the one of the workflow template files (under etc/) in addition to other options that are determined based on the attributes of the input raw laz files.{OUTPUT_DIRECTORY}
: This is the directory to post output. For jobs that have thestage
set to intermediate, this resolves to{WORKSPACE}/{sites_spec.site.state}/processing/{sites_spec.site.name}/{step_name}
; whenstage
is set toprocessed
, this resolves to `{WORKSPACE}/{sites_spec.site.state}/processed/{sites_spec.site.name}/{ODIX}
{SITE}
{ESPG}
: Spatial projection to project data to when callinglaser reproject
. This resolves to{sites_spec.site.espg}
{LTBASE}
: Path to LASTools bin. Resolves to{sites_spec.settings.ltbase}
How Pipelines are Defined¶
A concretized pipeline begins its life as one of the template pipelines specified in the etc/ directory.
etc/workflow-default.json
: A template pipeline that is used when data is in UTM coordinates and ground points are already classifiedetc/workflow-noground.json
: A template pipeline that is used when data is in UTM coordinates but ground points are not classifiedetc/workflow-notutm.json
: A template pipeline that is used when data is not in UTM coordinates.
The decision about which pipeline to use is made in the lbatch-generator
in the setWorkflow
routine; this decision is based on information returned by lazer.pdal.info(path)
Once the template pipeline is chosen, the lbatch-generator
proceeds by concretizing the template variables and resolving dependencies for each step. The result of running the lbatch-generator
is a concretized job dictionary.
The Job Dictionary File¶
The job dictionary file is a set of concretized batch jobs to submit to Slurm. Each job specified by the following attributes:
site
- The name of the LiDAR site being processedname
- A unique name for the job (within the scope of the dictionary)script
- The path to the batch script that will be executeddependencies
- A list of job names (also defined in the same dictionary) that the job depends onpartition
- The compute partition to submit the job tobatch_options
- The options to send to sbatch at schedulingworkspace
- The working directory to submit the job from. This is the same directory where stderr and stdout will be savedrun_id
- The unique identifier for the specific execution of a pipeline execution.
The job dictionary is ingested by the lbatch-manager
, which schedules the complete pipeline using sbatch commands. The lbatch-manager
resolves dependencies to the Slurm job id so that execution of each step in the pipeline is carried out in the intended order.