Geo Data Naming Conventions

Background¶

We're developing stricter file naming conventions to simplify searching for and working with internal geospatial data. We want to transition to API/database-driven workflows in the future, but for the next few months it will be simpler and faster to standardize file paths.

For now, the primary workflow we're aiming to support is:

Process airborne LiDAR data from multiple sites (i.e., small regional AOIs) into multiple metrics of vegetation structure
Spatially and temporally co-register these airborne data with multi-scale, multi-sensor satellite observations
Extract (n_samples, height, width, n_bands) archives from the LiDAR and satellite data for each site
Merge these site archives into collections for modeling

There are many ways to mix and match these data, and many different processing settings to track. We often use the same satellite datasets to model multiple vegetation metrics (e.g., modeling canopy height & cover with Sentinel1/Sentinel2). And we also often use different satellite datasets to model vegetation metrics at different resolutions (e.g., using Planet for 3m models, Sentinel1/Sentinel2 for 10m models).

This process is more of an art than a science. We often experiment with which datasets to combine, how many sample locations to draw, what resolution and window size to use - which places a lot of burden on the end user to coordinate and track data linkages.

These new filename conventions are designed to make it easy for humans and computers alike to trace data between sources, especially for model training and experimentation. And when we build out API/database-centric workflows, I expect we'll continue to store data paths with these conventions in addition to making file metadata easily searchable.

What we want¶

To create unique identifiers for lidar, satellite, and deep-learning datasets that semantically track the connections between them in a human-readable format.

What we don't want¶

To use filenames as a substitute for metadata.

A number of important geospatial features are encoded in metadata: resolution, geographic bounds, xsize/ysize.
Some of these features may be included in filenames (e.g. resolution), but should not be referenced analytically.
The reason to include metadata in the filename is to make it easy for users to semantically query filenames based on metadata attributes.

Guidelines¶

Limit the use of subdirectories wherever possible: prefer flat file structures.
Use hyphens (-) to separate terms.
Use underscores ('_') to join compound terms/numbers (e.g., 'au_nsw' for a country/state geography, '0801_0930' for a date range)
Use lowercase for all terms, except for abbreviations or proper nouns (e.g. geography, site, or constellation)
Avoid using terms that can be easily modified in the data itself (e.g. don't specify the spatial projection, which could easily be modified by reprojection, requiring renaming the file).

Glossary¶

archive: we extract data of shape (n_samples, height, width, n_bands) to train deep learning models from sites around the world, and each site's samples are referred to as an "archive".

constellation: common name for a satellite constellation/mission (e.g. sentinel2, planetscope).

geography: place-based name for the region where the data were sourced from. Public lidar data are typically collected by state or federal agencies, and there is often redundancy in site names between (e.g., there is a klamath site in both California and Oregon). For these data, name the geography as {country_code}_{subregion} (for example, use au_nsw for New South Wales in Australia, us_wa for Washington state) to ensure geography names will scale globally. For satellite data, this refers to the full extent of the spatial coverage (e.g. west_coast). Don't use coordinates or bounding box information.

library_name: we merge archives from multiple sites into a "library", combining samples from different sites during model training. There are many ways to mix and match archives, which we can't prescribe a priori (merging by geography, year, feature combinations, etc.). library_name should then be a semantically descriptive reference for the archives merged.

measurement_type: the processing level or the data type of satellite imagery. Planet provides data as visual (uint8) or analytic (uint16) products, for example. Could also distinguish between thermal and vswir for Sentinel2.

metric: the ecological/environmental variable represented by the dataset. Each metric has associated units and, often, a known min/max range. Should generally follow Essential Biodiversity Variable conventions: metrics are typically a) biological, b) sensitive to change, and c) ecosystem agnostic (as in, consistent everywhere). Example lidar data metrics can be found here.

site: place-based name describing a local/regional spatial extent where the data were collected. The {geography}-{site}-{year} combination should provide sufficiently unique combinations to avoid conflicts between datasets.

time_of_acquisition: unformatted, semantic reference to an image collection time or a collection period. Could be a month (jan), a date for a single satellite image (0825) or a date range for an image mosaic (0101_0331). datetime should be recorded in metadata, but not in filenames.

year: data acquisition year.

LiDAR paths¶

Processed data paths:

gs://forestobs-lidar-processed/{geography}-{site}-{year}-{metric}.{extension}

Raw data paths:

gs://forestobs-lidar-raw/{geography}-{site}-{year}/*.{laz|las}
gs://forestobs-lidar-raw/{geography}-{site}-{year}/*.json
gs://forestobs-lidar-raw/{geography}-{site}-{year}/collection.json

Where each raw .laz file has an accompanying .json STAC item record, and collection.json is a STAC collection organizing all items in this directory. This collection should include additional information on the data provider (e.g. USGS, USFS) and the path to the raw data (e.g. s3://path/to/3dep/data).

Raw paths include a subdirectory because las file archives contain many files with non-standard naming structures. Mixing these raw paths in the same directory structure would be too difficult to handle because we still rely on filename-based input file queries with lbatch.

Satellite paths¶

gs://forestobs-satellite/{constellation}-{measurement_type}-{geography}-{year}-{time_of_acquisition}.{extension}

Example: gs://forestobs-satellite/planetscope-analytic-west_coast-2020-spring.tif

Satellite datasets often have additional metadata and mask files that match the dimensions/extent of the satellite measurements. These additional datasets should be specified within the {measurement_type} name, extending it with an _:

gs://forestobs-satellite/planetscope-analytic_udm-west_coast-2020-spring.tif
gs://forestobs-satellite/planetscope-analytic_hot_pixel_mask-west_coast-2020-spring.tif

So when searching you can distinguish between searching exclusively for measurement data and searching for all associated data:

gsutil ls gs://forestobs-satellite/planetscope-analytic-* # search for all Planet analytic measurements
gsutil ls gs://forestobs-satellite/planetscope-analytic_* # search for all analytic and metadata files

Deep learning archives¶

Point sample locations:

gs://forestobs-samples/{site}-{year}-{time_of_acquisition}.gpkg

These sample locations are typically created on a site-by-site basis and can be used multiple times: to draw samples from multiple metrics, and from multiple feature datasets. These don't need to include many metadata-specific terms.

Full sample archives, however, have many distinct properties to track, which we partially nest in subdirectories:

gs://forestobs-archives/{metric}-{resolution}-response/{site}-{year}-{time_of_acquisition}-{window_size}.response.npy
gs://forestobs-archives/{metric}-{resolution}-{features}/{site}-{year}-{time_of_acquisition}-{window_size}.features.npy

In this case, response is hardcoded, and features is a variable. Archives are broken up into subdirectories because we often experiment with testing multiple feature data combinations that co-align with the same response data archive. Example:

gs://forestobs-archives/canopyheight-00010m-response/san_juan-2019-spring-64_64.response.npy
gs://forestobs-archives/canopyheight-00010m-sentinel1_sentinel2/san_juan-2019-spring-64_64.features.npy
gs://forestobs-archives/canopyheight-00010m-sentinel1_sentinel2_ned/san_juan-2019-spring-64_64.features.npy

Several myco routines produce and expect matching response/feature dataset names, which is why it's important to maintain consistent basenames between feature/response data while also providing space for multipe feature data combinations.

These site-based archives are then merged into a "collection" of archives that are the input to train deep learning models:

gs://forestobs-libraries/{library_name}-{metric}-{resolution}.response.npy
gs://forestobs-libraries/{library_name}-{metric}-{resolution}.features.npy

These collections often merge data from multiple years and multiple acquisition times, and it's likely easiest to be non-prescriptive in dictating filename conventions here at this time.