Introduction

This post is a rehash of a previous post with the same title except for the 2.0.

For each calendar code we start out with a random forest model that tries to forecast the value of the price with input features consisting of the stock-to-usage numbers of

Argentina
Brazil
China
Ukraine
United States
World
World without China

for both corn and soybeans as well as the number of days the contract has to expiry. The feature importance is done with classifiction models were we bin the prices into deciles. The model is then trained to find the price decile. We use six techniques to compare feature importance of the trained classification:

MDI - Mean Decrease Impurity
MDA - Mean Decrease Accuracy
SFI - Single Feature Importance
CFI - Clustered Feature Importance
SHAP - Shapley Feature Importance
PCA - Principle component analysis

The PCA method is used to calculate the weighted tau statistic. The idea is to see how correlated the principle components and the chosen features are. The higher the weighted tau number the better.

Next we train regression modeld on the reduced feature space. These models are thes used within the fingerprint method of Li, Turkington and Yazdani to study the linear and non-linear effects present in the resulting models. This method helps us eleminate even more redundant features by allowing us to ony keep those features responsible for the majority of the linear, non-linear and interaction affects present within the features.

Finally, after the main features have been extracted we train classication and regression models on the chosen features. We do this to give two different but related points of view. From the classification models we can determine the probabiliy of the spread beying in a particular decile. We can then study how the probabilities change by changing the input features. Secondly we use all of the trees in the regression models to produce regression statistics. Here we are particularly interested in the the 25th to 75th percentile of the regression models.

In each of the sections that follow we consider a single Corn contract.

This is a technical piece, but allows us to view under the hood of the black box and give a better understanding of how the model will perform under different circumstances.

H

H - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code	method	weighted_tau
H	SHAP	0.567
H	MDI	0.175
H	CFI	0.090
H	MDA	-0.153
H	SFI	-0.277

Below we show the feature importances of the top three positive weighted tau methods.

From the above the model features reduce to

C_unitedstates
C_world
C_worldnochina
S_world
S_unitedstates
daysdiff
crude
S_china

H - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

From the plot above notice that the linear and non-linear effects attributable to

S_brazil
S_worldnochina
S_world

are very small. In the following we remove these features from the analysis.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Elbows in the data are interesting and show that certain threshold values of the underlying feature show great variability either side of the threshold. An example of this below is in the C_world feature at a stock-to-usage of around 13%.

H - Model Results

The tables below show the model performance data. Note the out of sample accuracy is improved upon by the Random Forest model.

sample	type	Logistic Regression	Random Forest
in sample	classification	0.44	1.00
oob	classification	NA	0.45
out of sample	classification	0.19	0.38

The out of sample regression results are also much better than the linear models. The in-sample results are the same, however the Random Forest model outperforms on unseen data.. A reason for this may be that the Random Forest model is better able to capture the nonlinear and interacton effects that might be present in the features.

sample	type	Lasso Regression	Linear Regression	Random Forest
in sample	regression	0.78	0.78	0.97
oob	regression	NA	NA	0.83
out of sample	regression	0.46	0.46	0.62

The plot below shows the probability associated with each of the spread deciles using the latest fudamental data as input parameters.

The table below shows the regression model results.

code	p25	med	avg	p75	expiry	price
H	389.23	397.83	401.15	404.51	18698	345.25

K

K - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code	method	weighted_tau
K	SHAP	0.632
K	MDI	0.265
K	CFI	0.202
K	MDA	0.196
K	SFI	-0.242

Below we show the feature importances of the top three positive weighted tau methods.

From the above the model features then become

C_argentina
C_china
C_unitedstates
C_world
C_worldnochina
S_argentina
S_china
S_unitedstates
S_world
S_worldnochina
crude
daysdiff
dollarindex
ruble

K - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects. From the plot below it is clear that the most important features are

C_unitedstates
C_world
C_worldnochina
S_china
S_unitedstates
crude
daysdiff

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Below we only show the data for the top features according to the fingerprint method.

K - Model Results

The tables below show the model performance data. Note the out of sample accuracy is improved upon by the Random Forest model. The out of sample regression results are also much better than the linear models.

sample	type	Logistic Regression	Random Forest
in sample	classification	0.44	1.00
oob	classification	NA	0.42
out of sample	classification	0.15	0.38

sample	type	Lasso Regression	Linear Regression	Random Forest
in sample	regression	0.82	0.82	0.98
oob	regression	NA	NA	0.85
out of sample	regression	0.61	0.61	0.70

The plot below shows the probability associated with each of the spread deciles using the latest fudamental data as input parameters.

The table below shows the regression model results.

code	p25	med	avg	p75	expiry	price
K	402.96	409.42	418.43	431.73	18761	352.5

N

N- Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code	method	weighted_tau
N	CFI	0.090
N	SFI	-0.073
N	MDI	-0.114
N	MDA	-0.170
N	SHAP	-0.184

Below we show the feature importances of the top three positive weighted tau methods.

Since the weighted tau of CFI is much greater than the other methods we consider the most important features to be formed by the intersection of the top features in these two methods. From the above the model features then become

dollarindex
ruble
C_argentina
C_china
C_unitedstates
C_world
C_worldnochina
S_argentina
S_unitedstates
S_world
S_worldnochina

N - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

Notice that the effects attributable to

C_world
C_worldnochina
C_unitedstates
S_unitedstates
ruble

are the main contributing features. It makes sense to retrain the model on these features alone.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Here we have only shown the most dominant features.

N - Model Results

The tables below show the model performance data. Note the out of sample accuracy is improved upon by the Random Forest model. The out of sample regression results are also much better than the linear models.

sample	type	Logistic Regression	Random Forest
in sample	classification	0.31	1.00
oob	classification	NA	0.36
out of sample	classification	0.08	0.23

sample	type	Lasso Regression	Linear Regression	Random Forest
in sample	regression	0.80	0.80	0.96
oob	regression	NA	NA	0.86
out of sample	regression	0.62	0.62	0.77

The plot below shows the probability associated with each of the spread deciles using the latest fudamental data as input parameters.

The table below shows the regression model results.

code	p25	med	avg	p75	expiry	price
N	382.49	385.5	386.94	390.94	18457	317.25

U

U - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code	method	weighted_tau
U	CFI	0.090
U	MDA	-0.077
U	MDI	-0.214
U	SHAP	-0.354
U	SFI	-0.367

Below we show the feature importances of the top three positive weighted tau methods.

Since the weighted tau of CFI is much greater than the other methods we consider the most important features to be formed by the intersection of the top features in these two methods. From the above the model features then become

dollarindex
ruble
C_argentina
C_china
C_unitedstates
C_world
C_worldnochina
S_argentina
S_unitedstates
S_world
S_worldnochina

U - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

Notice that the effects attributable to

C_unitedstates
C_world
S_unitestates
C_worldnochina
C_argentina
dollarindex

are the main contributing features. It makes sense to retrain the model on these features alone.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Here we only show the data for the above features.

U - Model Results

The tables below show the model performance data. Note the out of sample accuracy is improved upon by the Random Forest model. The out of sample regression results are also much better than the linear models.

sample	type	Logistic Regression	Random Forest
in sample	classification	0.31	0.97
oob	classification	NA	0.31
out of sample	classification	0.12	0.42

sample	type	Lasso Regression	Linear Regression	Random Forest
in sample	regression	0.68	0.68	0.96
oob	regression	NA	NA	0.85
out of sample	regression	0.64	0.64	0.87

The plot below shows the probability associated with each of the spread deciles using the latest fudamental data as input parameters.

The table below shows the regression model results.

code	p25	med	avg	p75	expiry	price
U	383.89	387.97	388.41	393.27	18519	321.5

Z

Z - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code	method	weighted_tau
Z	SHAP	0.362
Z	MDA	0.300
Z	CFI	0.200
Z	MDI	-0.027
Z	SFI	-0.230

Below we show the feature importances of the top three positive weighted tau methods.

From the above the model features then become

C_argentina
C_china
C_russia
C_ukraine
C_unitedstates
C_world
C_worldnochina
S_argentina
S_brazil
S_china
S_russia
S_unitedstates
S_world
S_worldnochina
crude
daysdiff
dollarindex
ruble

Z - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Below we only show results for the top features.

Z - Model Results

The tables below show the model performance data. Note the out of sample accuracy is improved upon by the Random Forest model. The out of sample regression results are also much better than the linear models.

sample	type	Logistic Regression	Random Forest
in sample	classification	0.44	1.00
oob	classification	NA	0.44
out of sample	classification	0.38	0.58

sample	type	Lasso Regression	Linear Regression	Random Forest
in sample	regression	0.69	0.69	0.97
oob	regression	NA	NA	0.79
out of sample	regression	0.69	0.69	0.83

The plot below shows the probability associated with each of the spread deciles using the latest fudamental data as input parameters.

The table below shows the regression model results.

code	p25	med	avg	p75	expiry	price
Z	388.75	392.69	393.38	396.05	18610	332

Grouped Forecasts

In the image below we show the forecasted price interval for each of the contract codes. The solid and dashed black lines represent the median and average values of the model forecasts. The blue shaded region shows the 25th to 75th percentile of the model forecasts. The current prices are given by the red line.

Note that these forecasts whould not be used for dealing with calendar spreads. The process used to model the calendar spreads are different, and more reliable since calendar spreads have a much more well defined range.

From the model predictions above we can see that corn is undervalues all along the futures curve. Naturally these models have not seen the COVID-19 virus and negative oild prices before and they might not generalise well under these conditions.

Conclusion

This post investigates what fundametal features are the main drivers in determining the price of corn. The features we investigated consists of the stock-to-usage numbers of

Argentina
Brazil
China
Ukraine
United States
World
World without China

for both corn and soybeans as well as the number of days the contract has to expiry. We also include energy and dollar proxies with the average value of crude in the month prior and the value of the dollar index.

We made use of modern methods of feature importnace to determine the main driving features withing each of the models. These methods include

MDI - Mean Decrease Impurity
MDA - Mean Decrease Accuracy
SFI - Single Feature Importance
CFI - Clustered Feature Importance
SHAP - Shapley Feature Importance
PCA - Principle component analysis
Weighted tau
Fingerprint method

In nearly all the cases show we see a marked improvement in the out of sample test results compared to standard linear models that are widely used in the literature as well as industry practitioners. The reason for this is that the simple machine learning model was able to capture not only the linear but also non-linear and interaction effects present in the features.

Corn price vs Stock-to-Usage 2.0

Introduction

H

H - Feature Importance

H - Fingerprint Method

H - Model Results

K

K - Feature Importance

K - Fingerprint Method

K - Model Results

N

N- Feature Importance

N - Fingerprint Method

N - Model Results

U

U - Feature Importance

U - Fingerprint Method

U - Model Results

Z

Z - Feature Importance

Z - Fingerprint Method

Z - Model Results

Grouped Forecasts

Conclusion

Mauritz van den Worm

Portfolio Manager and Quantitative Researcher

Related