Exploring form and subform classifiers¶

This notebook contains scripts for training pixel-based classifiers to try and predict vegetation forms and sub-forms with global climate and vegetation data.

I recommend running this notebook on an instance with lots of CPUs. Decision tree models are CPU hungry, we have many sammples to predict with, and many hyperparameter combinations to evaluate.

In [1]:

            
                Copied!
                
                    
                    
                
                

        
# data io
import os
import pandas as pd
import geopandas as gpd
import elapid as ela

# modeling
import giants
import numpy as np
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler

# plotting
import artlo
from artlo import plt
# data io
import os
import pandas as pd
import geopandas as gpd
import elapid as ela

# modeling
import giants
import numpy as np
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler

# plotting
import artlo
from artlo import plt

In [2]:

            
                Copied!
                
# paths
base = '/home/cba/forestobs-nsw'
plots = os.path.join(base, 'docs', 'img')
sample_path = 'gs://nsw-scratch/vector/nsw-class-samples-200k-annotated.gpkg'
meta_path = 'gs://nsw-scratch/raster/nsw-veg-raw.csv'
# paths
base = '/home/cba/forestobs-nsw'
plots = os.path.join(base, 'docs', 'img')
sample_path = 'gs://nsw-scratch/vector/nsw-class-samples-200k-annotated.gpkg'
meta_path = 'gs://nsw-scratch/raster/nsw-veg-raw.csv'

Reading and formatting the data¶

The metadata contains linkages between numeric form/sub-forms and their string names, which we'll use for plotting and interpretation.

We'll also be dropping class 0, which is the 'Cleared' formation. This refers to urban and agricultural areas, which are typically considered 'non-burnable' in bushfire models. We'll map these with a separate process.

In [3]:

            
                Copied!
                
# read the data into memory
meta = pd.read_csv(meta_path)
samples = gpd.read_file(sample_path)
# read the data into memory
meta = pd.read_csv(meta_path)
samples = gpd.read_file(sample_path)

In [4]:

            
                Copied!
                
# link up the metadata class names to their numerical specs
class_nums = list(meta['ClassNumbe'].unique())
form_nums = list(meta['Formatio_1'].unique())

class_nums.sort()
form_nums.sort()

class_names = [meta['ClassName'][meta['ClassNumbe'] == class_num].iloc[0] for class_num in class_nums]
form_names = [meta['FormationN'][meta['Formatio_1'] == form_num].iloc[0] for form_num in form_nums]
# link up the metadata class names to their numerical specs
class_nums = list(meta['ClassNumbe'].unique())
form_nums = list(meta['Formatio_1'].unique())

class_nums.sort()
form_nums.sort()

class_names = [meta['ClassName'][meta['ClassNumbe'] == class_num].iloc[0] for class_num in class_nums]
form_names = [meta['FormationN'][meta['Formatio_1'] == form_num].iloc[0] for form_num in form_nums]

In [5]:

            
                Copied!
                
# remove all 'cleared' samples
veg = samples[samples['formations'] != class_nums[class_names.index('Cleared')]].reset_index(drop=True)

# drop na values
veg.dropna(inplace=True)
# remove all 'cleared' samples
veg = samples[samples['formations'] != class_nums[class_names.index('Cleared')]].reset_index(drop=True)

# drop na values
veg.dropna(inplace=True)

Model training¶

For our model training, we'll evaluate a series of models:

Predicting vegetation formations
- one-against-all approach
- random forest, gradient boosting classifiers
- raw covariates, pca-transformed covariates
Predicting vegetation sub-forms
- one-against-some: creating an n-class model where n is the number of subforms in a formation
- random forest, gradient boosting classifiers
- raw covariates, pca-transformed covariates

We'll create one train/test split - the training data will first be used in k-fold cross val to run a hyperparameter grid search. This is done with the giants package, which handles the optimization search.

In [6]:

            
                Copied!
                
                    
                    
                
                

        
# data prep
x = veg.drop(columns=['formations', 'classes', 'geometry']).to_numpy()
y = veg['formations'].astype('int').to_numpy()

# run the pca transform
pca_scaler = Pipeline([("normalize", RobustScaler()), ("pca", PCA(whiten=True))])
xpca = pca_scaler.fit_transform(x)

# set weights so each class is scaled inversely proportional to it's frequency in the data
unique_vals = np.unique(y)
counts = [(y == val).sum() for val in unique_vals]
form_weights = [1 - (count / len(y)) for count in counts]
label_weights = np.zeros((len(y),))
for val, weight in zip(unique_vals, form_weights):
    label_weights[y == val] = weight

# create the train test splits
test_size = 0.33
xtrain, xtest, ytrain, ytest, wtrain, wtest = train_test_split(x, y, label_weights, test_size=test_size)
xptrain, xptest, yptrain, yptest, wptrain, wptest = train_test_split(xpca, y, label_weights, test_size=test_size)

# create the  model tuners to explore different model types
tuner = giants.model.Tuner(xtrain, ytrain)
pc_tuner = giants.model.Tuner(xptrain, yptrain)
# data prep
x = veg.drop(columns=['formations', 'classes', 'geometry']).to_numpy()
y = veg['formations'].astype('int').to_numpy()

# run the pca transform
pca_scaler = Pipeline([("normalize", RobustScaler()), ("pca", PCA(whiten=True))])
xpca = pca_scaler.fit_transform(x)

# set weights so each class is scaled inversely proportional to it's frequency in the data
unique_vals = np.unique(y)
counts = [(y == val).sum() for val in unique_vals]
form_weights = [1 - (count / len(y)) for count in counts]
label_weights = np.zeros((len(y),))
for val, weight in zip(unique_vals, form_weights):
    label_weights[y == val] = weight

# create the train test splits
test_size = 0.33
xtrain, xtest, ytrain, ytest, wtrain, wtest = train_test_split(x, y, label_weights, test_size=test_size)
xptrain, xptest, yptrain, yptest, wptrain, wptest = train_test_split(xpca, y, label_weights, test_size=test_size)

# create the  model tuners to explore different model types
tuner = giants.model.Tuner(xtrain, ytrain)
pc_tuner = giants.model.Tuner(xptrain, yptrain)

Exploratory plotting¶

Let's look at the data!

In [7]:

            
                Copied!
                
artlo.plot.set_style('salo-light')
artlo.plot.set_style('salo-light')

In [8]:

            
                Copied!
                
                    
                    
                
                

        
# plot a histogram of formation frequency
bins = np.unique(y)
n_bins = len(bins)
counts = [(y == val).sum() for val in bins]
density = [100 * count / len(y) for count in counts]
colors = [artlo.color.cmaps.gummy(b/n_bins) for b in bins]
np.random.shuffle(colors)

plt.figure(figsize=(5.5, 6.5))
plt.bar(bins, density, color=colors)

# add text labels to each bin
for xloc, yloc in zip(bins, density):
    plt.text(xloc, yloc+0.5, f"{yloc:0.1f}%", fontsize='x-small', horizontalalignment='center')

# labeling
plt.xticks(bins, form_names[1:], rotation=90)
plt.ylabel('Relative frequency (%)')
plt.title("NSW vegetation form abundance")
plt.tight_layout()
plt.savefig(os.path.join(plots, f'nsw-veg-form-abundance.png'), dpi=200)
# plot a histogram of formation frequency
bins = np.unique(y)
n_bins = len(bins)
counts = [(y == val).sum() for val in bins]
density = [100 * count / len(y) for count in counts]
colors = [artlo.color.cmaps.gummy(b/n_bins) for b in bins]
np.random.shuffle(colors)

plt.figure(figsize=(5.5, 6.5))
plt.bar(bins, density, color=colors)

# add text labels to each bin
for xloc, yloc in zip(bins, density):
    plt.text(xloc, yloc+0.5, f"{yloc:0.1f}%", fontsize='x-small', horizontalalignment='center')

# labeling
plt.xticks(bins, form_names[1:], rotation=90)
plt.ylabel('Relative frequency (%)')
plt.title("NSW vegetation form abundance")
plt.tight_layout()
plt.savefig(os.path.join(plots, f'nsw-veg-form-abundance.png'), dpi=200)

In [9]:

            
                Copied!
                
                    
                    
                
                

        
# let's 2d plot the PCA data and color by class
n_samples = 10000
sample_idxs = np.random.randint(0, len(y), n_samples)

pc1 = 7
pc2 = 8
ax1 = xpca[sample_idxs, pc1]
ax2 = xpca[sample_idxs, pc2]
samples = y[sample_idxs]

for form_val, form_name, form_color in zip(bins, form_names[1:], colors):
    form_idx = samples == form_val
    plt.scatter(ax1[form_idx], ax2[form_idx], color=form_color, label=form_name, alpha=0.5)
    
plt.legend(loc='upper right', ncol=1, fontsize='xx-small')
plt.xlabel(f"PC {pc1}")
plt.ylabel(f"PC {pc2}")
plt.tight_layout()
# let's 2d plot the PCA data and color by class
n_samples = 10000
sample_idxs = np.random.randint(0, len(y), n_samples)

pc1 = 7
pc2 = 8
ax1 = xpca[sample_idxs, pc1]
ax2 = xpca[sample_idxs, pc2]
samples = y[sample_idxs]

for form_val, form_name, form_color in zip(bins, form_names[1:], colors):
    form_idx = samples == form_val
    plt.scatter(ax1[form_idx], ax2[form_idx], color=form_color, label=form_name, alpha=0.5)
    
plt.legend(loc='upper right', ncol=1, fontsize='xx-small')
plt.xlabel(f"PC {pc1}")
plt.ylabel(f"PC {pc2}")
plt.tight_layout()

Random forest modeling with original covariates¶

In [10]:

            
                Copied!
                
# review the param grid
rfc_grid = giants.config.ParamGrids.RandomForestClassifier
rfc_grid
# review the param grid
rfc_grid = giants.config.ParamGrids.RandomForestClassifier
rfc_grid

Out[10]:

{'criterion': ('gini', 'entropy'),
 'n_estimators': (10, 100, 200),
 'max_features': ('sqrt', 'log2', None),
 'max_depth': (1, 10, None),
 'min_samples_split': (2, 0.1, 0.01)}

In [11]:

            
                Copied!
                
# update grid search params
rfc_grid.update(n_estimators=(10, 100), min_samples_split=(2, 0.1))

# tune and evaluate the random forest model
tuner.RandomForestClassifier(scorer='f1_weighted', param_grid=rfc_grid, fit_params={'sample_weight': wtrain})
# update grid search params
rfc_grid.update(n_estimators=(10, 100), min_samples_split=(2, 0.1))

# tune and evaluate the random forest model
tuner.RandomForestClassifier(scorer='f1_weighted', param_grid=rfc_grid, fit_params={'sample_weight': wtrain})

Fitting 4 folds for each of 72 candidates, totalling 288 fits

In [12]:

            
                Copied!
                
# train on the full dataset (previous model just runs k-fold x-val)
tuner.best_estimator.fit(xtrain, ytrain, sample_weight=wtrain)
ypred = tuner.best_estimator.predict(xtest)

print(tuner.best_params)
print(metrics.classification_report(ytest, ypred, target_names=form_names[1:]))

# save it for later
model_path = os.path.join(base, 'model-rfc-origdata.ela')
ela.save_object(tuner.best_estimator, model_path)
# train on the full dataset (previous model just runs k-fold x-val)
tuner.best_estimator.fit(xtrain, ytrain, sample_weight=wtrain)
ypred = tuner.best_estimator.predict(xtest)

print(tuner.best_params)
print(metrics.classification_report(ytest, ypred, target_names=form_names[1:]))

# save it for later
model_path = os.path.join(base, 'model-rfc-origdata.ela')
ela.save_object(tuner.best_estimator, model_path)

{'criterion': 'entropy', 'max_depth': None, 'max_features': None, 'min_samples_split': 2, 'n_estimators': 100}
                                                    precision    recall  f1-score   support

                                       Rainforests       0.56      0.34      0.42       492
    Wet sclerophyll forests (Shrubby subformation)       0.49      0.45      0.47      1081
     Wet sclerophyll forests (Grassy subformation)       0.53      0.54      0.54      1458
                                  Grassy woodlands       0.58      0.45      0.51      1494
                                        Grasslands       0.67      0.63      0.65      1064
Dry sclerophyll forests (Shrub/grass subformation)       0.59      0.60      0.59      2324
    Dry sclerophyll forests (Shrubby subformation)       0.66      0.79      0.72      3991
                                        Heathlands       0.44      0.15      0.23       123
                                    Alpine complex       0.79      0.74      0.76       136
                               Freshwater wetlands       0.68      0.35      0.46      1181
                                 Forested wetlands       0.59      0.43      0.50       837
                                  Saline  wetlands       0.41      0.25      0.31        55
         Semi-arid woodlands (Grassy subformation)       0.64      0.47      0.54      3900
        Semi-arid woodlands (Shrubby subformation)       0.79      0.84      0.81      9542
           Arid shrublands (Chenopod subformation)       0.73      0.82      0.77      5691
             Arid shrublands (Acacia subformation)       0.75      0.81      0.78      7284

                                          accuracy                           0.70     40653
                                         macro avg       0.62      0.54      0.57     40653
                                      weighted avg       0.70      0.70      0.69     40653

Gradient boosting model with original covariates¶

In [13]:

            
                Copied!
                
# review the param grid
gbc_grid = giants.config.ParamGrids.GradientBoostingClassifier
gbc_grid
# review the param grid
gbc_grid = giants.config.ParamGrids.GradientBoostingClassifier
gbc_grid

Out[13]:

{'n_estimators': (10, 100, 200),
 'learning_rate': (0.01, 0.1, 0.5),
 'max_features': ('sqrt', 'log2', None),
 'max_depth': (1, 10, None),
 'min_samples_split': (2, 0.1, 0.01)}

In [14]:

            
                Copied!
                
# update the grid search params
gbc_grid.update(n_estimators=(10, 100), learning_rate=(0.1, 0.5), min_samples_split=(2, 0.1))

# tune the gradient boosting model
tuner.GradientBoostingClassifier(scorer='f1_weighted', param_grid=gbc_grid, fit_params={'sample_weight': wtrain})
# update the grid search params
gbc_grid.update(n_estimators=(10, 100), learning_rate=(0.1, 0.5), min_samples_split=(2, 0.1))

# tune the gradient boosting model
tuner.GradientBoostingClassifier(scorer='f1_weighted', param_grid=gbc_grid, fit_params={'sample_weight': wtrain})

Fitting 4 folds for each of 72 candidates, totalling 288 fits

In [15]:

            
                Copied!
                
                    
                    
                
                

        
# train on the full dataset
tuner.best_estimator.fit(xtrain, ytrain, sample_weight=wtrain)
ypred = tuner.best_estimator.predict(xtest)

# evaluate
print(tuner.best_params)
print(metrics.classification_report(ytest, ypred, target_names=form_names[1:]))

# save
model_path = os.path.join(base, 'model-gbc-origdata.ela')
ela.save_object(tuner.best_estimator, model_path)
# train on the full dataset
tuner.best_estimator.fit(xtrain, ytrain, sample_weight=wtrain)
ypred = tuner.best_estimator.predict(xtest)

# evaluate
print(tuner.best_params)
print(metrics.classification_report(ytest, ypred, target_names=form_names[1:]))

# save
model_path = os.path.join(base, 'model-gbc-origdata.ela')
ela.save_object(tuner.best_estimator, model_path)

{'learning_rate': 0.1, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 100}
                                                    precision    recall  f1-score   support

                                       Rainforests       0.44      0.33      0.38       492
    Wet sclerophyll forests (Shrubby subformation)       0.47      0.43      0.45      1081
     Wet sclerophyll forests (Grassy subformation)       0.53      0.52      0.52      1458
                                  Grassy woodlands       0.58      0.45      0.51      1494
                                        Grasslands       0.70      0.58      0.63      1064
Dry sclerophyll forests (Shrub/grass subformation)       0.60      0.57      0.58      2324
    Dry sclerophyll forests (Shrubby subformation)       0.65      0.79      0.71      3991
                                        Heathlands       0.17      0.18      0.18       123
                                    Alpine complex       0.70      0.72      0.71       136
                               Freshwater wetlands       0.73      0.32      0.45      1181
                                 Forested wetlands       0.59      0.42      0.49       837
                                  Saline  wetlands       0.18      0.22      0.20        55
         Semi-arid woodlands (Grassy subformation)       0.66      0.45      0.54      3900
        Semi-arid woodlands (Shrubby subformation)       0.78      0.85      0.81      9542
           Arid shrublands (Chenopod subformation)       0.73      0.83      0.77      5691
             Arid shrublands (Acacia subformation)       0.75      0.81      0.78      7284

                                          accuracy                           0.70     40653
                                         macro avg       0.58      0.53      0.54     40653
                                      weighted avg       0.69      0.70      0.69     40653

Random forest model with PCA covariates¶

In [20]:

            
                Copied!
                
pc_tuner.RandomForestClassifier(scorer='f1_weighted', param_grid=rfc_grid, fit_params={"sample_weight": wptrain})
pc_tuner.RandomForestClassifier(scorer='f1_weighted', param_grid=rfc_grid, fit_params={"sample_weight": wptrain})

Fitting 4 folds for each of 72 candidates, totalling 288 fits

In [21]:

            
                Copied!
                
# re-fit
pc_tuner.best_estimator.fit(xptrain, yptrain, sample_weight=wptrain)
yppred = pc_tuner.best_estimator.predict(xptest)

# evaluate
print(metrics.classification_report(yptest, yppred, target_names=form_names[1:]))

# save
model_path = os.path.join(base, 'model-rfc-pcadata.ela')
ela.save_object(pc_tuner.best_estimator, model_path)
# re-fit
pc_tuner.best_estimator.fit(xptrain, yptrain, sample_weight=wptrain)
yppred = pc_tuner.best_estimator.predict(xptest)

# evaluate
print(metrics.classification_report(yptest, yppred, target_names=form_names[1:]))

# save
model_path = os.path.join(base, 'model-rfc-pcadata.ela')
ela.save_object(pc_tuner.best_estimator, model_path)

                                                    precision    recall  f1-score   support

                                       Rainforests       0.56      0.35      0.43       473
    Wet sclerophyll forests (Shrubby subformation)       0.50      0.43      0.46      1097
     Wet sclerophyll forests (Grassy subformation)       0.52      0.52      0.52      1446
                                  Grassy woodlands       0.60      0.44      0.51      1583
                                        Grasslands       0.66      0.50      0.57      1100
Dry sclerophyll forests (Shrub/grass subformation)       0.56      0.53      0.54      2189
    Dry sclerophyll forests (Shrubby subformation)       0.62      0.80      0.70      4000
                                        Heathlands       0.29      0.04      0.07       122
                                    Alpine complex       0.74      0.72      0.73       121
                               Freshwater wetlands       0.68      0.31      0.42      1105
                                 Forested wetlands       0.60      0.38      0.47       808
                                  Saline  wetlands       0.39      0.13      0.20        52
         Semi-arid woodlands (Grassy subformation)       0.61      0.42      0.50      3848
        Semi-arid woodlands (Shrubby subformation)       0.74      0.82      0.78      9612
           Arid shrublands (Chenopod subformation)       0.70      0.76      0.73      5817
             Arid shrublands (Acacia subformation)       0.69      0.76      0.72      7280

                                          accuracy                           0.67     40653
                                         macro avg       0.59      0.49      0.52     40653
                                      weighted avg       0.66      0.67      0.66     40653

Gradient boosting model with PCA covariates¶

In [ ]:

            
                Copied!
                
pc_tuner.GradientBoostingClassifier(scorer='f1_weighted', param_grid=gbc_grid, fit_params={"sample_weight": wptrain})
pc_tuner.GradientBoostingClassifier(scorer='f1_weighted', param_grid=gbc_grid, fit_params={"sample_weight": wptrain})

Fitting 4 folds for each of 72 candidates, totalling 288 fits

re-fit¶

pc_tuner.best_estimator.fit(xptrain, yptrain, sample_weight=wptrain) yppred = pc_tuner.best_estimator.predict(xptest)

evaluate¶

print(metrics.classification_report(yptest, yppred, target_names=form_names[1:]))

save¶

model_path = os.path.join(base, 'model-gbc-pcadata.ela') ela.save_object(pc_tuner.best_estimator, model_path)