Introduction

In this post we explore the feature importance of the Arabica vs Robusta coffee arbitrage. As input features we use Arabica and Robusta consumer stocks broken down by

GCA - Green Coffee Association
EU - European Union
Japan
Total
Importing Consumptin
Importing S/C - Total/Importing Consumption

To add a measure of possible seasonality we also include the number of days until expiry of the first contract in the arbitrage.

We model both the spread and ratio separately as they can have quite different behaviour. The feature importance is done with classifiction models were we bin the prices into deciles. The model is then trained to find the price decile. We use six techniques to compare feature importance of the trained classification:

MDI - Mean Decrease Impurity
MDA - Mean Decrease Accuracy
SFI - Single Feature Importance
CFI - Clustered Feature Importance
SHAP - Shapley Feature Importance
PCA - Principle component analysis

The PCA method is used to calculate the weighted tau statistic. The idea is to see how correlated the principle components and the chosen features are. The higher the weighted tau number the better.

Next we train regression modeld on the reduced feature space. These models are thes used within the fingerprint method of Li, Turkington and Yazdani to study the linear and non-linear effects present in the resulting models. This method helps us eleminate even more redundant features by allowing us to ony keep those features responsible for the majority of the linear, non-linear and interaction affects present within the features.

Finally, after the main features have been extracted we train regression models on the chosen features and make these results available in Shiny.

F

F - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code	method	weighted_tau	type
F	MDA	0.692	ratio
F	CFI	0.171	ratio
F	SHAP	0.036	ratio
F	SFI	-0.016	ratio
F	MDI	-0.507	ratio

code	method	weighted_tau	type
F	SFI	0.545	spread
F	MDA	0.352	spread
F	SHAP	0.070	spread
F	MDI	-0.042	spread

Below we show the feature importances of the top positive weighted tau methods.

F - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Elbows in the data are interesting and show that certain threshold values of the underlying feature show great variability either side of the threshold.

F - Model Results

The tables below show the model performance data. The out of sample regression results are also much better than the linear models. The in-sample results are the same, however the Random Forest model outperforms on unseen data. A reason for this may be that the Random Forest model is better able to capture the nonlinear and interacton effects that might be present in the features.

type	sample	Lasso Regression	Linear Regression	Random Forest
ratio	in sample	0.57	0.59	0.74
ratio	oob	NA	NA	0.68
ratio	out of sample	0.52	0.55	0.61
spread	in sample	0.53	0.53	0.84
spread	oob	NA	NA	0.63
spread	out of sample	0.50	0.50	0.70

H

H - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code	method	weighted_tau	type
H	MDA	0.519	ratio
H	SHAP	0.401	ratio
H	SFI	0.383	ratio
H	CFI	0.368	ratio
H	MDI	-0.228	ratio

code	method	weighted_tau	type
H	SFI	0.296	spread
H	SHAP	0.281	spread
H	MDA	-0.233	spread
H	MDI	-0.267	spread

Below we show the feature importances of the top positive weighted tau methods.

H - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Elbows in the data are interesting and show that certain threshold values of the underlying feature show great variability either side of the threshold.

H - Model Results

The tables below show the model performance data. The out of sample regression results are also much better than the linear models. The in-sample results are the same, however the Random Forest model outperforms on unseen data. A reason for this may be that the Random Forest model is better able to capture the nonlinear and interacton effects that might be present in the features.

type	sample	Lasso Regression	Linear Regression	Random Forest
ratio	in sample	0.55	0.65	0.78
ratio	oob	NA	NA	-0.15
ratio	out of sample	0.55	0.65	0.66
spread	in sample	0.54	0.55	0.91
spread	oob	NA	NA	0.47
spread	out of sample	0.46	0.46	0.79

K

K - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code	method	weighted_tau	type
K	CFI	0.171	ratio
K	SFI	0.147	ratio
K	MDA	0.018	ratio
K	SHAP	-0.121	ratio
K	MDI	-0.522	ratio

code	method	weighted_tau	type
K	SFI	0.350	spread
K	SHAP	0.181	spread
K	MDI	0.013	spread
K	MDA	-0.188	spread

Below we show the feature importances of the top positive weighted tau methods.

K - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Elbows in the data are interesting and show that certain threshold values of the underlying feature show great variability either side of the threshold.

K - Model Results

The tables below show the model performance data. The out of sample regression results are also much better than the linear models. The in-sample results are the same, however the Random Forest model outperforms on unseen data. A reason for this may be that the Random Forest model is better able to capture the nonlinear and interacton effects that might be present in the features.

type	sample	Lasso Regression	Linear Regression	Random Forest
ratio	in sample	0.59	0.62	0.92
ratio	oob	NA	NA	0.73
ratio	out of sample	0.50	0.52	0.63
spread	in sample	0.54	0.54	0.95
spread	oob	NA	NA	0.78
spread	out of sample	0.41	0.41	0.78

N

N - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code	method	weighted_tau	type
N	CFI	0.336	ratio
N	MDA	0.242	ratio
N	SFI	0.118	ratio
N	MDI	0.117	ratio
N	SHAP	0.032	ratio

code	method	weighted_tau	type
N	SFI	0.198	spread
N	SHAP	0.142	spread
N	MDI	-0.119	spread
N	MDA	-0.156	spread

Below we show the feature importances of the top positive weighted tau methods.

N - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Elbows in the data are interesting and show that certain threshold values of the underlying feature show great variability either side of the threshold.

N - Model Results

The tables below show the model performance data. The out of sample regression results are also much better than the linear models. The in-sample results are the same, however the Random Forest model outperforms on unseen data. A reason for this may be that the Random Forest model is better able to capture the nonlinear and interacton effects that might be present in the features.

type	sample	Lasso Regression	Linear Regression	Random Forest
ratio	in sample	0.55	0.58	0.91
ratio	oob	NA	NA	0.71
ratio	out of sample	0.46	0.48	0.56
spread	in sample	0.63	0.63	0.93
spread	oob	NA	NA	0.67
spread	out of sample	0.39	0.39	0.71

U

U - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code	method	weighted_tau	type
U	SHAP	0.461	ratio
U	CFI	0.255	ratio
U	MDA	0.131	ratio
U	SFI	0.004	ratio
U	MDI	-0.071	ratio

code	method	weighted_tau	type
U	MDI	0.214	spread
U	SFI	0.175	spread
U	MDA	0.136	spread
U	SHAP	0.123	spread

Below we show the feature importances of the top positive weighted tau methods.

U - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Elbows in the data are interesting and show that certain threshold values of the underlying feature show great variability either side of the threshold.

U - Model Results

The tables below show the model performance data. The out of sample regression results are also much better than the linear models. The in-sample results are the same, however the Random Forest model outperforms on unseen data. A reason for this may be that the Random Forest model is better able to capture the nonlinear and interacton effects that might be present in the features.

type	sample	Lasso Regression	Linear Regression	Random Forest
ratio	in sample	0.57	0.60	0.92
ratio	oob	NA	NA	0.74
ratio	out of sample	0.54	0.60	0.71
spread	in sample	0.59	0.59	0.89
spread	oob	NA	NA	0.80
spread	out of sample	0.44	0.44	0.47

X

X - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code	method	weighted_tau	type
X	MDA	0.329	ratio
X	CFI	0.171	ratio
X	SHAP	0.035	ratio
X	SFI	-0.034	ratio
X	MDI	-0.263	ratio

code	method	weighted_tau	type
X	SFI	0.198	spread
X	SHAP	0.184	spread
X	MDI	0.033	spread
X	MDA	-0.006	spread

Below we show the feature importances of the top positive weighted tau methods.

X - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Elbows in the data are interesting and show that certain threshold values of the underlying feature show great variability either side of the threshold.

X - Model Results

The tables below show the model performance data. The out of sample regression results are also much better than the linear models. The in-sample results are the same, however the Random Forest model outperforms on unseen data. A reason for this may be that the Random Forest model is better able to capture the nonlinear and interacton effects that might be present in the features.

type	sample	Lasso Regression	Linear Regression	Random Forest
ratio	in sample	0.57	0.60	0.92
ratio	oob	NA	NA	0.76
ratio	out of sample	0.55	0.56	0.70
spread	in sample	0.52	0.52	0.84
spread	oob	NA	NA	0.59
spread	out of sample	0.51	0.52	0.56

Z

Z - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code	method	weighted_tau	type
Z	CFI	0.255	ratio
Z	SHAP	0.208	ratio
Z	MDA	0.181	ratio
Z	SFI	0.173	ratio
Z	MDI	-0.083	ratio

code	method	weighted_tau	type
Z	MDA	0.428	spread
Z	SFI	0.347	spread
Z	SHAP	0.117	spread
Z	MDI	-0.006	spread

Below we show the feature importances of the top positive weighted tau methods.

Z - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Elbows in the data are interesting and show that certain threshold values of the underlying feature show great variability either side of the threshold.

Z - Model Results

The tables below show the model performance data. The out of sample regression results are also much better than the linear models. The in-sample results are the same, however the Random Forest model outperforms on unseen data. A reason for this may be that the Random Forest model is better able to capture the nonlinear and interacton effects that might be present in the features.

type	sample	Lasso Regression	Linear Regression	Random Forest
ratio	in sample	0.59	0.64	0.96
ratio	oob	NA	NA	0.84
ratio	out of sample	0.53	0.61	0.90
spread	in sample	0.63	0.63	0.92
spread	oob	NA	NA	0.60
spread	out of sample	0.59	0.58	0.82

Conclusion

Throughout the random forest models perform better than their linear counterparts.
These models have been added to Shiny.

Coffee Arbitrage Feature Importance

Introduction

F

F - Feature Importance

F - Fingerprint Method

F - Model Results

H

H - Feature Importance

H - Fingerprint Method

H - Model Results

K

K - Feature Importance

K - Fingerprint Method

K - Model Results

N

N - Feature Importance

N - Fingerprint Method

N - Model Results

U

U - Feature Importance

U - Fingerprint Method

U - Model Results

X

X - Feature Importance

X - Fingerprint Method

X - Model Results

Z

Z - Feature Importance

Z - Fingerprint Method

Z - Model Results

Conclusion

Mauritz van den Worm

Portfolio Manager and Quantitative Researcher

Related