Exploring form and subform classifiers¶
This notebook contains scripts for training pixel-based classifiers to try and predict vegetation forms and sub-forms with global climate and vegetation data.
I recommend running this notebook on an instance with lots of CPUs. Decision tree models are CPU hungry, we have many sammples to predict with, and many hyperparameter combinations to evaluate.
# data io
import os
import pandas as pd
import geopandas as gpd
import elapid as ela
# modeling
import giants
import numpy as np
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
# plotting
import artlo
from artlo import plt
# paths
base = '/home/cba/forestobs-nsw'
plots = os.path.join(base, 'docs', 'img')
sample_path = 'gs://nsw-scratch/vector/nsw-class-samples-200k-annotated.gpkg'
meta_path = 'gs://nsw-scratch/raster/nsw-veg-raw.csv'
Reading and formatting the data¶
The metadata contains linkages between numeric form/sub-forms and their string names, which we'll use for plotting and interpretation.
We'll also be dropping class 0, which is the 'Cleared' formation. This refers to urban and agricultural areas, which are typically considered 'non-burnable' in bushfire models. We'll map these with a separate process.
# read the data into memory
meta = pd.read_csv(meta_path)
samples = gpd.read_file(sample_path)
# link up the metadata class names to their numerical specs
class_nums = list(meta['ClassNumbe'].unique())
form_nums = list(meta['Formatio_1'].unique())
class_nums.sort()
form_nums.sort()
class_names = [meta['ClassName'][meta['ClassNumbe'] == class_num].iloc[0] for class_num in class_nums]
form_names = [meta['FormationN'][meta['Formatio_1'] == form_num].iloc[0] for form_num in form_nums]
# remove all 'cleared' samples
veg = samples[samples['formations'] != class_nums[class_names.index('Cleared')]].reset_index(drop=True)
# drop na values
veg.dropna(inplace=True)
Model training¶
For our model training, we'll evaluate a series of models:
- Predicting vegetation formations
- one-against-all approach
- random forest, gradient boosting classifiers
- raw covariates, pca-transformed covariates
- Predicting vegetation sub-forms
- one-against-some: creating an n-class model where n is the number of subforms in a formation
- random forest, gradient boosting classifiers
- raw covariates, pca-transformed covariates
We'll create one train/test split - the training data will first be used in k-fold cross val to run a hyperparameter grid search. This is done with the giants package, which handles the optimization search.
# data prep
x = veg.drop(columns=['formations', 'classes', 'geometry']).to_numpy()
y = veg['formations'].astype('int').to_numpy()
# run the pca transform
pca_scaler = Pipeline([("normalize", RobustScaler()), ("pca", PCA(whiten=True))])
xpca = pca_scaler.fit_transform(x)
# set weights so each class is scaled inversely proportional to it's frequency in the data
unique_vals = np.unique(y)
counts = [(y == val).sum() for val in unique_vals]
form_weights = [1 - (count / len(y)) for count in counts]
label_weights = np.zeros((len(y),))
for val, weight in zip(unique_vals, form_weights):
label_weights[y == val] = weight
# create the train test splits
test_size = 0.33
xtrain, xtest, ytrain, ytest, wtrain, wtest = train_test_split(x, y, label_weights, test_size=test_size)
xptrain, xptest, yptrain, yptest, wptrain, wptest = train_test_split(xpca, y, label_weights, test_size=test_size)
# create the model tuners to explore different model types
tuner = giants.model.Tuner(xtrain, ytrain)
pc_tuner = giants.model.Tuner(xptrain, yptrain)
Exploratory plotting¶
Let's look at the data!
artlo.plot.set_style('salo-light')
# plot a histogram of formation frequency
bins = np.unique(y)
n_bins = len(bins)
counts = [(y == val).sum() for val in bins]
density = [100 * count / len(y) for count in counts]
colors = [artlo.color.cmaps.gummy(b/n_bins) for b in bins]
np.random.shuffle(colors)
plt.figure(figsize=(5.5, 6.5))
plt.bar(bins, density, color=colors)
# add text labels to each bin
for xloc, yloc in zip(bins, density):
plt.text(xloc, yloc+0.5, f"{yloc:0.1f}%", fontsize='x-small', horizontalalignment='center')
# labeling
plt.xticks(bins, form_names[1:], rotation=90)
plt.ylabel('Relative frequency (%)')
plt.title("NSW vegetation form abundance")
plt.tight_layout()
plt.savefig(os.path.join(plots, f'nsw-veg-form-abundance.png'), dpi=200)
# let's 2d plot the PCA data and color by class
n_samples = 10000
sample_idxs = np.random.randint(0, len(y), n_samples)
pc1 = 7
pc2 = 8
ax1 = xpca[sample_idxs, pc1]
ax2 = xpca[sample_idxs, pc2]
samples = y[sample_idxs]
for form_val, form_name, form_color in zip(bins, form_names[1:], colors):
form_idx = samples == form_val
plt.scatter(ax1[form_idx], ax2[form_idx], color=form_color, label=form_name, alpha=0.5)
plt.legend(loc='upper right', ncol=1, fontsize='xx-small')
plt.xlabel(f"PC {pc1}")
plt.ylabel(f"PC {pc2}")
plt.tight_layout()
Random forest modeling with original covariates¶
# review the param grid
rfc_grid = giants.config.ParamGrids.RandomForestClassifier
rfc_grid
{'criterion': ('gini', 'entropy'), 'n_estimators': (10, 100, 200), 'max_features': ('sqrt', 'log2', None), 'max_depth': (1, 10, None), 'min_samples_split': (2, 0.1, 0.01)}
# update grid search params
rfc_grid.update(n_estimators=(10, 100), min_samples_split=(2, 0.1))
# tune and evaluate the random forest model
tuner.RandomForestClassifier(scorer='f1_weighted', param_grid=rfc_grid, fit_params={'sample_weight': wtrain})
Fitting 4 folds for each of 72 candidates, totalling 288 fits
# train on the full dataset (previous model just runs k-fold x-val)
tuner.best_estimator.fit(xtrain, ytrain, sample_weight=wtrain)
ypred = tuner.best_estimator.predict(xtest)
print(tuner.best_params)
print(metrics.classification_report(ytest, ypred, target_names=form_names[1:]))
# save it for later
model_path = os.path.join(base, 'model-rfc-origdata.ela')
ela.save_object(tuner.best_estimator, model_path)
{'criterion': 'entropy', 'max_depth': None, 'max_features': None, 'min_samples_split': 2, 'n_estimators': 100} precision recall f1-score support Rainforests 0.56 0.34 0.42 492 Wet sclerophyll forests (Shrubby subformation) 0.49 0.45 0.47 1081 Wet sclerophyll forests (Grassy subformation) 0.53 0.54 0.54 1458 Grassy woodlands 0.58 0.45 0.51 1494 Grasslands 0.67 0.63 0.65 1064 Dry sclerophyll forests (Shrub/grass subformation) 0.59 0.60 0.59 2324 Dry sclerophyll forests (Shrubby subformation) 0.66 0.79 0.72 3991 Heathlands 0.44 0.15 0.23 123 Alpine complex 0.79 0.74 0.76 136 Freshwater wetlands 0.68 0.35 0.46 1181 Forested wetlands 0.59 0.43 0.50 837 Saline wetlands 0.41 0.25 0.31 55 Semi-arid woodlands (Grassy subformation) 0.64 0.47 0.54 3900 Semi-arid woodlands (Shrubby subformation) 0.79 0.84 0.81 9542 Arid shrublands (Chenopod subformation) 0.73 0.82 0.77 5691 Arid shrublands (Acacia subformation) 0.75 0.81 0.78 7284 accuracy 0.70 40653 macro avg 0.62 0.54 0.57 40653 weighted avg 0.70 0.70 0.69 40653
Gradient boosting model with original covariates¶
# review the param grid
gbc_grid = giants.config.ParamGrids.GradientBoostingClassifier
gbc_grid
{'n_estimators': (10, 100, 200), 'learning_rate': (0.01, 0.1, 0.5), 'max_features': ('sqrt', 'log2', None), 'max_depth': (1, 10, None), 'min_samples_split': (2, 0.1, 0.01)}
# update the grid search params
gbc_grid.update(n_estimators=(10, 100), learning_rate=(0.1, 0.5), min_samples_split=(2, 0.1))
# tune the gradient boosting model
tuner.GradientBoostingClassifier(scorer='f1_weighted', param_grid=gbc_grid, fit_params={'sample_weight': wtrain})
Fitting 4 folds for each of 72 candidates, totalling 288 fits
# train on the full dataset
tuner.best_estimator.fit(xtrain, ytrain, sample_weight=wtrain)
ypred = tuner.best_estimator.predict(xtest)
# evaluate
print(tuner.best_params)
print(metrics.classification_report(ytest, ypred, target_names=form_names[1:]))
# save
model_path = os.path.join(base, 'model-gbc-origdata.ela')
ela.save_object(tuner.best_estimator, model_path)
{'learning_rate': 0.1, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 100} precision recall f1-score support Rainforests 0.44 0.33 0.38 492 Wet sclerophyll forests (Shrubby subformation) 0.47 0.43 0.45 1081 Wet sclerophyll forests (Grassy subformation) 0.53 0.52 0.52 1458 Grassy woodlands 0.58 0.45 0.51 1494 Grasslands 0.70 0.58 0.63 1064 Dry sclerophyll forests (Shrub/grass subformation) 0.60 0.57 0.58 2324 Dry sclerophyll forests (Shrubby subformation) 0.65 0.79 0.71 3991 Heathlands 0.17 0.18 0.18 123 Alpine complex 0.70 0.72 0.71 136 Freshwater wetlands 0.73 0.32 0.45 1181 Forested wetlands 0.59 0.42 0.49 837 Saline wetlands 0.18 0.22 0.20 55 Semi-arid woodlands (Grassy subformation) 0.66 0.45 0.54 3900 Semi-arid woodlands (Shrubby subformation) 0.78 0.85 0.81 9542 Arid shrublands (Chenopod subformation) 0.73 0.83 0.77 5691 Arid shrublands (Acacia subformation) 0.75 0.81 0.78 7284 accuracy 0.70 40653 macro avg 0.58 0.53 0.54 40653 weighted avg 0.69 0.70 0.69 40653
Random forest model with PCA covariates¶
pc_tuner.RandomForestClassifier(scorer='f1_weighted', param_grid=rfc_grid, fit_params={"sample_weight": wptrain})
Fitting 4 folds for each of 72 candidates, totalling 288 fits
# re-fit
pc_tuner.best_estimator.fit(xptrain, yptrain, sample_weight=wptrain)
yppred = pc_tuner.best_estimator.predict(xptest)
# evaluate
print(metrics.classification_report(yptest, yppred, target_names=form_names[1:]))
# save
model_path = os.path.join(base, 'model-rfc-pcadata.ela')
ela.save_object(pc_tuner.best_estimator, model_path)
precision recall f1-score support Rainforests 0.56 0.35 0.43 473 Wet sclerophyll forests (Shrubby subformation) 0.50 0.43 0.46 1097 Wet sclerophyll forests (Grassy subformation) 0.52 0.52 0.52 1446 Grassy woodlands 0.60 0.44 0.51 1583 Grasslands 0.66 0.50 0.57 1100 Dry sclerophyll forests (Shrub/grass subformation) 0.56 0.53 0.54 2189 Dry sclerophyll forests (Shrubby subformation) 0.62 0.80 0.70 4000 Heathlands 0.29 0.04 0.07 122 Alpine complex 0.74 0.72 0.73 121 Freshwater wetlands 0.68 0.31 0.42 1105 Forested wetlands 0.60 0.38 0.47 808 Saline wetlands 0.39 0.13 0.20 52 Semi-arid woodlands (Grassy subformation) 0.61 0.42 0.50 3848 Semi-arid woodlands (Shrubby subformation) 0.74 0.82 0.78 9612 Arid shrublands (Chenopod subformation) 0.70 0.76 0.73 5817 Arid shrublands (Acacia subformation) 0.69 0.76 0.72 7280 accuracy 0.67 40653 macro avg 0.59 0.49 0.52 40653 weighted avg 0.66 0.67 0.66 40653
Gradient boosting model with PCA covariates¶
pc_tuner.GradientBoostingClassifier(scorer='f1_weighted', param_grid=gbc_grid, fit_params={"sample_weight": wptrain})
Fitting 4 folds for each of 72 candidates, totalling 288 fits
re-fit¶
pc_tuner.best_estimator.fit(xptrain, yptrain, sample_weight=wptrain) yppred = pc_tuner.best_estimator.predict(xptest)
evaluate¶
print(metrics.classification_report(yptest, yppred, target_names=form_names[1:]))
save¶
model_path = os.path.join(base, 'model-gbc-pcadata.ela') ela.save_object(pc_tuner.best_estimator, model_path)