Corn price vs Stock-to-Usage 2.0

Introduction

This post is a rehash of a previous post with the same title except for the 2.0.

For each calendar code we start out with a random forest model that tries to forecast the value of the price with input features consisting of the stock-to-usage numbers of

  • Argentina
  • Brazil
  • China
  • Ukraine
  • United States
  • World
  • World without China

for both corn and soybeans as well as the number of days the contract has to expiry. The feature importance is done with classifiction models were we bin the prices into deciles. The model is then trained to find the price decile. We use six techniques to compare feature importance of the trained classification:

  • MDI - Mean Decrease Impurity
  • MDA - Mean Decrease Accuracy
  • SFI - Single Feature Importance
  • CFI - Clustered Feature Importance
  • SHAP - Shapley Feature Importance
  • PCA - Principle component analysis

The PCA method is used to calculate the weighted tau statistic. The idea is to see how correlated the principle components and the chosen features are. The higher the weighted tau number the better.

Next we train regression modeld on the reduced feature space. These models are thes used within the fingerprint method of Li, Turkington and Yazdani to study the linear and non-linear effects present in the resulting models. This method helps us eleminate even more redundant features by allowing us to ony keep those features responsible for the majority of the linear, non-linear and interaction affects present within the features.

Finally, after the main features have been extracted we train classication and regression models on the chosen features. We do this to give two different but related points of view. From the classification models we can determine the probabiliy of the spread beying in a particular decile. We can then study how the probabilities change by changing the input features. Secondly we use all of the trees in the regression models to produce regression statistics. Here we are particularly interested in the the 25th to 75th percentile of the regression models.

In each of the sections that follow we consider a single Corn contract.

This is a technical piece, but allows us to view under the hood of the black box and give a better understanding of how the model will perform under different circumstances.

H

H - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code method weighted_tau
H SHAP 0.567
H MDI 0.175
H CFI 0.090
H MDA -0.153
H SFI -0.277

Below we show the feature importances of the top three positive weighted tau methods.

From the above the model features reduce to

  • C_unitedstates
  • C_world
  • C_worldnochina
  • S_world
  • S_unitedstates
  • daysdiff
  • crude
  • S_china

H - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

From the plot above notice that the linear and non-linear effects attributable to

  • S_brazil
  • S_worldnochina
  • S_world

are very small. In the following we remove these features from the analysis.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Elbows in the data are interesting and show that certain threshold values of the underlying feature show great variability either side of the threshold. An example of this below is in the C_world feature at a stock-to-usage of around 13%.

H - Model Results

The tables below show the model performance data. Note the out of sample accuracy is improved upon by the Random Forest model.

sample type Logistic Regression Random Forest
in sample classification 0.44 1.00
oob classification NA 0.45
out of sample classification 0.19 0.38

The out of sample regression results are also much better than the linear models. The in-sample results are the same, however the Random Forest model outperforms on unseen data.. A reason for this may be that the Random Forest model is better able to capture the nonlinear and interacton effects that might be present in the features.

sample type Lasso Regression Linear Regression Random Forest
in sample regression 0.78 0.78 0.97
oob regression NA NA 0.83
out of sample regression 0.46 0.46 0.62

The plot below shows the probability associated with each of the spread deciles using the latest fudamental data as input parameters.

The table below shows the regression model results.

code p25 med avg p75 expiry price
H 389.23 397.83 401.15 404.51 18698 345.25

K

K - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code method weighted_tau
K SHAP 0.632
K MDI 0.265
K CFI 0.202
K MDA 0.196
K SFI -0.242

Below we show the feature importances of the top three positive weighted tau methods.

From the above the model features then become

  • C_argentina
  • C_china
  • C_unitedstates
  • C_world
  • C_worldnochina
  • S_argentina
  • S_china
  • S_unitedstates
  • S_world
  • S_worldnochina
  • crude
  • daysdiff
  • dollarindex
  • ruble

K - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects. From the plot below it is clear that the most important features are

  • C_unitedstates
  • C_world
  • C_worldnochina
  • S_china
  • S_unitedstates
  • crude
  • daysdiff

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Below we only show the data for the top features according to the fingerprint method.

K - Model Results

The tables below show the model performance data. Note the out of sample accuracy is improved upon by the Random Forest model. The out of sample regression results are also much better than the linear models.

sample type Logistic Regression Random Forest
in sample classification 0.44 1.00
oob classification NA 0.42
out of sample classification 0.15 0.38
sample type Lasso Regression Linear Regression Random Forest
in sample regression 0.82 0.82 0.98
oob regression NA NA 0.85
out of sample regression 0.61 0.61 0.70

The plot below shows the probability associated with each of the spread deciles using the latest fudamental data as input parameters.

The table below shows the regression model results.

code p25 med avg p75 expiry price
K 402.96 409.42 418.43 431.73 18761 352.5

N

N- Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code method weighted_tau
N CFI 0.090
N SFI -0.073
N MDI -0.114
N MDA -0.170
N SHAP -0.184

Below we show the feature importances of the top three positive weighted tau methods.

Since the weighted tau of CFI is much greater than the other methods we consider the most important features to be formed by the intersection of the top features in these two methods. From the above the model features then become

  • dollarindex
  • ruble
  • C_argentina
  • C_china
  • C_unitedstates
  • C_world
  • C_worldnochina
  • S_argentina
  • S_unitedstates
  • S_world
  • S_worldnochina

N - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

Notice that the effects attributable to

  • C_world
  • C_worldnochina
  • C_unitedstates
  • S_unitedstates
  • ruble

are the main contributing features. It makes sense to retrain the model on these features alone.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Here we have only shown the most dominant features.

N - Model Results

The tables below show the model performance data. Note the out of sample accuracy is improved upon by the Random Forest model. The out of sample regression results are also much better than the linear models.

sample type Logistic Regression Random Forest
in sample classification 0.31 1.00
oob classification NA 0.36
out of sample classification 0.08 0.23
sample type Lasso Regression Linear Regression Random Forest
in sample regression 0.80 0.80 0.96
oob regression NA NA 0.86
out of sample regression 0.62 0.62 0.77

The plot below shows the probability associated with each of the spread deciles using the latest fudamental data as input parameters.

The table below shows the regression model results.

code p25 med avg p75 expiry price
N 382.49 385.5 386.94 390.94 18457 317.25

U

U - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code method weighted_tau
U CFI 0.090
U MDA -0.077
U MDI -0.214
U SHAP -0.354
U SFI -0.367

Below we show the feature importances of the top three positive weighted tau methods.

Since the weighted tau of CFI is much greater than the other methods we consider the most important features to be formed by the intersection of the top features in these two methods. From the above the model features then become

  • dollarindex
  • ruble
  • C_argentina
  • C_china
  • C_unitedstates
  • C_world
  • C_worldnochina
  • S_argentina
  • S_unitedstates
  • S_world
  • S_worldnochina

U - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

Notice that the effects attributable to

  • C_unitedstates
  • C_world
  • S_unitestates
  • C_worldnochina
  • C_argentina
  • dollarindex

are the main contributing features. It makes sense to retrain the model on these features alone.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Here we only show the data for the above features.

U - Model Results

The tables below show the model performance data. Note the out of sample accuracy is improved upon by the Random Forest model. The out of sample regression results are also much better than the linear models.

sample type Logistic Regression Random Forest
in sample classification 0.31 0.97
oob classification NA 0.31
out of sample classification 0.12 0.42
sample type Lasso Regression Linear Regression Random Forest
in sample regression 0.68 0.68 0.96
oob regression NA NA 0.85
out of sample regression 0.64 0.64 0.87

The plot below shows the probability associated with each of the spread deciles using the latest fudamental data as input parameters.

The table below shows the regression model results.

code p25 med avg p75 expiry price
U 383.89 387.97 388.41 393.27 18519 321.5

Z

Z - Feature Importance

The table below shows the weighted tau values of the different feature importance techniques employed.

code method weighted_tau
Z SHAP 0.362
Z MDA 0.300
Z CFI 0.200
Z MDI -0.027
Z SFI -0.230

Below we show the feature importances of the top three positive weighted tau methods.

From the above the model features then become

  • C_argentina
  • C_china
  • C_russia
  • C_ukraine
  • C_unitedstates
  • C_world
  • C_worldnochina
  • S_argentina
  • S_brazil
  • S_china
  • S_russia
  • S_unitedstates
  • S_world
  • S_worldnochina
  • crude
  • daysdiff
  • dollarindex
  • ruble

Z - Fingerprint Method

The plot below shows the normalised linear and non-linear effects in a sigle graph for easy comparison. Notice the substantial contribution from the non-linear effects.

The plot below shows the linear and non linear effects as captured by the fingerprint method. These effects can be measured and compared by the shaded areas in the facet plot below. The greater the shaded area the greater the effect. The linear fit inside each facet captures the movement in the price when the feature values are changed. Below we only show results for the top features.

Z - Model Results

The tables below show the model performance data. Note the out of sample accuracy is improved upon by the Random Forest model. The out of sample regression results are also much better than the linear models.

sample type Logistic Regression Random Forest
in sample classification 0.44 1.00
oob classification NA 0.44
out of sample classification 0.38 0.58
sample type Lasso Regression Linear Regression Random Forest
in sample regression 0.69 0.69 0.97
oob regression NA NA 0.79
out of sample regression 0.69 0.69 0.83

The plot below shows the probability associated with each of the spread deciles using the latest fudamental data as input parameters.

The table below shows the regression model results.

code p25 med avg p75 expiry price
Z 388.75 392.69 393.38 396.05 18610 332

Grouped Forecasts

In the image below we show the forecasted price interval for each of the contract codes. The solid and dashed black lines represent the median and average values of the model forecasts. The blue shaded region shows the 25th to 75th percentile of the model forecasts. The current prices are given by the red line.

Note that these forecasts whould not be used for dealing with calendar spreads. The process used to model the calendar spreads are different, and more reliable since calendar spreads have a much more well defined range.

From the model predictions above we can see that corn is undervalues all along the futures curve. Naturally these models have not seen the COVID-19 virus and negative oild prices before and they might not generalise well under these conditions.

Conclusion

This post investigates what fundametal features are the main drivers in determining the price of corn. The features we investigated consists of the stock-to-usage numbers of

  • Argentina
  • Brazil
  • China
  • Ukraine
  • United States
  • World
  • World without China

for both corn and soybeans as well as the number of days the contract has to expiry. We also include energy and dollar proxies with the average value of crude in the month prior and the value of the dollar index.

We made use of modern methods of feature importnace to determine the main driving features withing each of the models. These methods include

  • MDI - Mean Decrease Impurity
  • MDA - Mean Decrease Accuracy
  • SFI - Single Feature Importance
  • CFI - Clustered Feature Importance
  • SHAP - Shapley Feature Importance
  • PCA - Principle component analysis
  • Weighted tau
  • Fingerprint method

In nearly all the cases show we see a marked improvement in the out of sample test results compared to standard linear models that are widely used in the literature as well as industry practitioners. The reason for this is that the simple machine learning model was able to capture not only the linear but also non-linear and interaction effects present in the features.

Avatar
Mauritz van den Worm
Portfolio Manager and Quantitative Researcher

My research interests include the use of artificial intelligence in managing commodity portfolios

Related