Notes from Nick Fabina¶

Editor's note: Nick was previously the Principal Data Scientist at Salo Sciences, who left abruptly after about 14 months with the team. His software never got to the point of maturity where it could be integrated into operational modeling, but he left behind a series of notes from his experiments. This document consolidates them, and have been minimally edited.

Architecture notes from CFO testing¶

Convolutions:

use bottleneck 1x1 layers to reduce feature maps before 3x3 or nxn convolutions, increases efficiency and performance
use more feature maps rather than more convolutions, e.g., use one 3x3 convolution @128 filters rather than two 3x3 convolutions @96 filters, increases efficiency and performance
instead of 3x3 convolutions, use 3x1 -> 1x3 convolutions; instead of 5x5 convolutions, use 3x1 -> 1x3 -> 3x1 -> 1x3 convolutions, increases efficiency

Activations:

too many activations reduce network performance with remote sensing data
do not use activations on 1x1 layers prior to full convolutions
do not use activations on multiple branches before concatenating and applying 1x1s
use softmax when you're transforming one-hot encoded data to combine with continuous data
e.g., in inception blocks, activations should only be applied on the output layer

Pooling:

try to use min pooling with remote sensing data with tf.keras.layers.MaxPooling2D(...)(-layer), did not give clear performance boosts in tests but seems like it should really help
larger pool sizes, e.g., 5, did not seem to do as well as multiple pools @3 with convolutions between

Colorspace transformations, i.e., transforming inputs using 1x1 convs:

very helpful at improving performance when used before initial convolutions
found success with using initial, expanded colorspace transformation on inputs, and a second, simplified colorspace transformation to be used as a passthrough layer
can improve performance if inputs are transformed prior to being passed through to output layers

Passthrough layers, i.e., passing earlier layers to later layers:

key to localization later in the network, avoiding blur
too much passthrough can hurt performance
very effective when used in output layer, i.e., final convolutions

Output layers:

does your output look blurry? it could be because you have too many convolutions and not enough passthrough layers to add localization
convolving the architecture output with inputs or transformed inputs can help (see colorspace transformations)
having limited or no 3x3 convolutions in the output layer can help if you need fine detail
don't skimp on filters here, having too few will really hurt you quickly, having too many will hurt you but much more slowly

Architectures:

additional passes through network, e.g., w-net, did not seem to be helpful in experiments, but could be a matter of finding the right implementation or use case, seems like it should be useful a priori but only minimal benefits after limited testing
separate branches for data at with the same spatial window at different resolutions seemed successful
encoder/decoder structure, e.g., u-net, did not seem to perform better than inception blocks, surprisingly

Multiple responses:

needs a lot of work to figure out what's best, might not even be possible to get good performance here relative to individual models, the key would be finding an architecture that improves performance or generalizes better
simply using the same structure with one filter per output class: not good enough unless all responses use all identical features
so the key is figuring out how to 1) convolve features together but split the network into different components at a certain point to get better individual predictions, or 2) make predictions for each response and convolve those predictions together to improve final predictions somehow, etc
side-note: we can report metrics for different responses by having one output per response from the network, even if they use the same features, as the metrics will be applied to each output individually

TODOs for testing¶

alternate pooling blocks with convolving blocks, as an alternative to inception blocks with both included
use larger data windows, rather than 64px at 10m, use 96px or 128px to see if spatial patterns become more important/helpful
test what happens if you have different branches through the network for 1) raw inputs and inputs that have been 2) min pooled and 3) max pooled; the extremes are hardest to get so I wonder what would happen if the different branches could focus on the extremes, perhaps even create a response / loss function that gives bonus points for the max pooled branch hitting the zeros or the min pooled branch hitting the highs, or use the different branches to show the uncertainty of the values
test separable convs -> fewer parameters, like conv2d but you're doing convolutions across each band, individually, and then convolving between bands, so perhaps you're getting better information? this might be most useful near the beginning of the network before the band information has been spatially mixed too greatly, tried as a replacement for Conv2D and didn't work with a simple substitution, but still potential there
test depthwise convs -> one half of a separable conv, keeping bands separate for longer to maybe find different features across different bands, tried as a replacement for Conv2D and didn't work with a simple substitution, but still potential there
test 7x1 -> 1x7 convs, see if getting more global information helps to make local predictions, this would be an alternative to multiple 3x3s, seeing if we can do more with shallower and wider networks
experiment with both data dropout (needed for features with nodata values so that model learns how to handle) and network dropout (could help make more robust networks)

Thoughts on module structure¶

After experimenting locally, I've found that it may be easiest to have three levels of complexity:

basic components: extremely simple and easy to write, but some benefit to having them standardized and tested; e.g., convolutional layers with batch normalization and dropout
intermediate components: slightly more complex, but still relatively simple, building blocks for full architectures; e.g., colorspace transformations, dense blocks, inception blocks, output layers
full architectures: always accept a list of inputs and return a list of outputs, constructed from basic and intermediate components; expect there to be a lot of variety here and customizations for individual projects so make it as easy as possible to build new architectures with basic and intermediate components