Exploring form and subform classifiers¶
This notebook contains scripts for training pixel-based classifiers to try and predict vegetation forms and sub-forms with global climate and vegetation data.
I recommend running this notebook on an instance with lots of CPUs. Decision tree models are CPU hungry, we have many sammples to predict with, and many hyperparameter combinations to evaluate.
# data io
import os
import pandas as pd
import geopandas as gpd
import elapid as ela
# modeling
import giants
import numpy as np
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
# plotting
import artlo
from artlo import plt
# paths
base = '/home/cba/forestobs-nsw'
plots = os.path.join(base, 'docs', 'img')
sample_path = 'gs://nsw-scratch/vector/nsw-class-samples-200k-annotated.gpkg'
meta_path = 'gs://nsw-scratch/raster/nsw-veg-raw.csv'
Reading and formatting the data¶
The metadata contains linkages between numeric form/sub-forms and their string names, which we'll use for plotting and interpretation.
We'll also be dropping class 0, which is the 'Cleared' formation. This refers to urban and agricultural areas, which are typically considered 'non-burnable' in bushfire models. We'll map these with a separate process.
# read the data into memory
meta = pd.read_csv(meta_path)
samples = gpd.read_file(sample_path)
# link up the metadata class names to their numerical specs
class_nums = list(meta['ClassNumbe'].unique())
form_nums = list(meta['Formatio_1'].unique())
class_nums.sort()
form_nums.sort()
class_names = [meta['ClassName'][meta['ClassNumbe'] == class_num].iloc[0] for class_num in class_nums]
form_names = [meta['FormationN'][meta['Formatio_1'] == form_num].iloc[0] for form_num in form_nums]
# remove all 'cleared' samples
veg = samples[samples['formations'] != class_nums[class_names.index('Cleared')]].reset_index(drop=True)
# drop na values
veg.dropna(inplace=True)
Model training¶
For our model training, we'll evaluate a series of models:
- Predicting vegetation formations
- one-against-all approach
- random forest, gradient boosting classifiers
- raw covariates, pca-transformed covariates
- Predicting vegetation sub-forms
- one-against-some: creating an n-class model where n is the number of subforms in a formation
- random forest, gradient boosting classifiers
- raw covariates, pca-transformed covariates
We'll create one train/test split - the training data will first be used in k-fold cross val to run a hyperparameter grid search. This is done with the giants package, which handles the optimization search.
# data prep
x = veg.drop(columns=['formations', 'classes', 'geometry']).to_numpy()
y = veg['formations'].astype('int').to_numpy()
# run the pca transform
pca_scaler = Pipeline([("normalize", RobustScaler()), ("pca", PCA(whiten=True))])
xpca = pca_scaler.fit_transform(x)
# set weights so each class is scaled inversely proportional to it's frequency in the data
unique_vals = np.unique(y)
counts = [(y == val).sum() for val in unique_vals]
form_weights = [1 - (count / len(y)) for count in counts]
label_weights = np.zeros((len(y),))
for val, weight in zip(unique_vals, form_weights):
label_weights[y == val] = weight
# create the train test splits
test_size = 0.33
xtrain, xtest, ytrain, ytest, wtrain, wtest = train_test_split(x, y, label_weights, test_size=test_size)
xptrain, xptest, yptrain, yptest, wptrain, wptest = train_test_split(xpca, y, label_weights, test_size=test_size)
# create the model tuners to explore different model types
tuner = giants.model.Tuner(xtrain, ytrain)
pc_tuner = giants.model.Tuner(xptrain, yptrain)
Exploratory plotting¶
Let's look at the data!
artlo.plot.set_style('salo-light')
# plot a histogram of formation frequency
bins = np.unique(y)
n_bins = len(bins)
counts = [(y == val).sum() for val in bins]
density = [100 * count / len(y) for count in counts]
colors = [artlo.color.cmaps.gummy(b/n_bins) for b in bins]
np.random.shuffle(colors)
plt.figure(figsize=(5.5, 6.5))
plt.bar(bins, density, color=colors)
# add text labels to each bin
for xloc, yloc in zip(bins, density):
plt.text(xloc, yloc+0.5, f"{yloc:0.1f}%", fontsize='x-small', horizontalalignment='center')
# labeling
plt.xticks(bins, form_names[1:], rotation=90)
plt.ylabel('Relative frequency (%)')
plt.title("NSW vegetation form abundance")
plt.tight_layout()
plt.savefig(os.path.join(plots, f'nsw-veg-form-abundance.png'), dpi=200)
# let's 2d plot the PCA data and color by class
n_samples = 10000
sample_idxs = np.random.randint(0, len(y), n_samples)
pc1 = 7
pc2 = 8
ax1 = xpca[sample_idxs, pc1]
ax2 = xpca[sample_idxs, pc2]
samples = y[sample_idxs]
for form_val, form_name, form_color in zip(bins, form_names[1:], colors):
form_idx = samples == form_val
plt.scatter(ax1[form_idx], ax2[form_idx], color=form_color, label=form_name, alpha=0.5)
plt.legend(loc='upper right', ncol=1, fontsize='xx-small')
plt.xlabel(f"PC {pc1}")
plt.ylabel(f"PC {pc2}")
plt.tight_layout()
Random forest modeling with original covariates¶
# review the param grid
rfc_grid = giants.config.ParamGrids.RandomForestClassifier
rfc_grid
{'criterion': ('gini', 'entropy'),
'n_estimators': (10, 100, 200),
'max_features': ('sqrt', 'log2', None),
'max_depth': (1, 10, None),
'min_samples_split': (2, 0.1, 0.01)}
# update grid search params
rfc_grid.update(n_estimators=(10, 100), min_samples_split=(2, 0.1))
# tune and evaluate the random forest model
tuner.RandomForestClassifier(scorer='f1_weighted', param_grid=rfc_grid, fit_params={'sample_weight': wtrain})
Fitting 4 folds for each of 72 candidates, totalling 288 fits
# train on the full dataset (previous model just runs k-fold x-val)
tuner.best_estimator.fit(xtrain, ytrain, sample_weight=wtrain)
ypred = tuner.best_estimator.predict(xtest)
print(tuner.best_params)
print(metrics.classification_report(ytest, ypred, target_names=form_names[1:]))
# save it for later
model_path = os.path.join(base, 'model-rfc-origdata.ela')
ela.save_object(tuner.best_estimator, model_path)
{'criterion': 'entropy', 'max_depth': None, 'max_features': None, 'min_samples_split': 2, 'n_estimators': 100}
precision recall f1-score support
Rainforests 0.56 0.34 0.42 492
Wet sclerophyll forests (Shrubby subformation) 0.49 0.45 0.47 1081
Wet sclerophyll forests (Grassy subformation) 0.53 0.54 0.54 1458
Grassy woodlands 0.58 0.45 0.51 1494
Grasslands 0.67 0.63 0.65 1064
Dry sclerophyll forests (Shrub/grass subformation) 0.59 0.60 0.59 2324
Dry sclerophyll forests (Shrubby subformation) 0.66 0.79 0.72 3991
Heathlands 0.44 0.15 0.23 123
Alpine complex 0.79 0.74 0.76 136
Freshwater wetlands 0.68 0.35 0.46 1181
Forested wetlands 0.59 0.43 0.50 837
Saline wetlands 0.41 0.25 0.31 55
Semi-arid woodlands (Grassy subformation) 0.64 0.47 0.54 3900
Semi-arid woodlands (Shrubby subformation) 0.79 0.84 0.81 9542
Arid shrublands (Chenopod subformation) 0.73 0.82 0.77 5691
Arid shrublands (Acacia subformation) 0.75 0.81 0.78 7284
accuracy 0.70 40653
macro avg 0.62 0.54 0.57 40653
weighted avg 0.70 0.70 0.69 40653
Gradient boosting model with original covariates¶
# review the param grid
gbc_grid = giants.config.ParamGrids.GradientBoostingClassifier
gbc_grid
{'n_estimators': (10, 100, 200),
'learning_rate': (0.01, 0.1, 0.5),
'max_features': ('sqrt', 'log2', None),
'max_depth': (1, 10, None),
'min_samples_split': (2, 0.1, 0.01)}
# update the grid search params
gbc_grid.update(n_estimators=(10, 100), learning_rate=(0.1, 0.5), min_samples_split=(2, 0.1))
# tune the gradient boosting model
tuner.GradientBoostingClassifier(scorer='f1_weighted', param_grid=gbc_grid, fit_params={'sample_weight': wtrain})
Fitting 4 folds for each of 72 candidates, totalling 288 fits
# train on the full dataset
tuner.best_estimator.fit(xtrain, ytrain, sample_weight=wtrain)
ypred = tuner.best_estimator.predict(xtest)
# evaluate
print(tuner.best_params)
print(metrics.classification_report(ytest, ypred, target_names=form_names[1:]))
# save
model_path = os.path.join(base, 'model-gbc-origdata.ela')
ela.save_object(tuner.best_estimator, model_path)
{'learning_rate': 0.1, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 100}
precision recall f1-score support
Rainforests 0.44 0.33 0.38 492
Wet sclerophyll forests (Shrubby subformation) 0.47 0.43 0.45 1081
Wet sclerophyll forests (Grassy subformation) 0.53 0.52 0.52 1458
Grassy woodlands 0.58 0.45 0.51 1494
Grasslands 0.70 0.58 0.63 1064
Dry sclerophyll forests (Shrub/grass subformation) 0.60 0.57 0.58 2324
Dry sclerophyll forests (Shrubby subformation) 0.65 0.79 0.71 3991
Heathlands 0.17 0.18 0.18 123
Alpine complex 0.70 0.72 0.71 136
Freshwater wetlands 0.73 0.32 0.45 1181
Forested wetlands 0.59 0.42 0.49 837
Saline wetlands 0.18 0.22 0.20 55
Semi-arid woodlands (Grassy subformation) 0.66 0.45 0.54 3900
Semi-arid woodlands (Shrubby subformation) 0.78 0.85 0.81 9542
Arid shrublands (Chenopod subformation) 0.73 0.83 0.77 5691
Arid shrublands (Acacia subformation) 0.75 0.81 0.78 7284
accuracy 0.70 40653
macro avg 0.58 0.53 0.54 40653
weighted avg 0.69 0.70 0.69 40653
Random forest model with PCA covariates¶
pc_tuner.RandomForestClassifier(scorer='f1_weighted', param_grid=rfc_grid, fit_params={"sample_weight": wptrain})
Fitting 4 folds for each of 72 candidates, totalling 288 fits
# re-fit
pc_tuner.best_estimator.fit(xptrain, yptrain, sample_weight=wptrain)
yppred = pc_tuner.best_estimator.predict(xptest)
# evaluate
print(metrics.classification_report(yptest, yppred, target_names=form_names[1:]))
# save
model_path = os.path.join(base, 'model-rfc-pcadata.ela')
ela.save_object(pc_tuner.best_estimator, model_path)
precision recall f1-score support
Rainforests 0.56 0.35 0.43 473
Wet sclerophyll forests (Shrubby subformation) 0.50 0.43 0.46 1097
Wet sclerophyll forests (Grassy subformation) 0.52 0.52 0.52 1446
Grassy woodlands 0.60 0.44 0.51 1583
Grasslands 0.66 0.50 0.57 1100
Dry sclerophyll forests (Shrub/grass subformation) 0.56 0.53 0.54 2189
Dry sclerophyll forests (Shrubby subformation) 0.62 0.80 0.70 4000
Heathlands 0.29 0.04 0.07 122
Alpine complex 0.74 0.72 0.73 121
Freshwater wetlands 0.68 0.31 0.42 1105
Forested wetlands 0.60 0.38 0.47 808
Saline wetlands 0.39 0.13 0.20 52
Semi-arid woodlands (Grassy subformation) 0.61 0.42 0.50 3848
Semi-arid woodlands (Shrubby subformation) 0.74 0.82 0.78 9612
Arid shrublands (Chenopod subformation) 0.70 0.76 0.73 5817
Arid shrublands (Acacia subformation) 0.69 0.76 0.72 7280
accuracy 0.67 40653
macro avg 0.59 0.49 0.52 40653
weighted avg 0.66 0.67 0.66 40653
Gradient boosting model with PCA covariates¶
pc_tuner.GradientBoostingClassifier(scorer='f1_weighted', param_grid=gbc_grid, fit_params={"sample_weight": wptrain})
Fitting 4 folds for each of 72 candidates, totalling 288 fits
re-fit¶
pc_tuner.best_estimator.fit(xptrain, yptrain, sample_weight=wptrain) yppred = pc_tuner.best_estimator.predict(xptest)
evaluate¶
print(metrics.classification_report(yptest, yppred, target_names=form_names[1:]))
save¶
model_path = os.path.join(base, 'model-gbc-pcadata.ela') ela.save_object(pc_tuner.best_estimator, model_path)