# 1 Introduction

Here we explore the viability of modelling the price of soybeans as a function of stock-to-usage. The market receives new information about the state of global stocks once a month after the WASDE reports have been published. As the global balance sheets change during the course of the season the expectation of the stock levels left over at the end of the season changes. We aim to model the soybean price along the futures curve as a function of stock-to-usage percentages of the major producing and consuming nations. We add a proxy for energy by looking at the average WTI crude price during the prior month. Furthermore we also consider the dollar strength as measured by the dollar index.

The plot below shows the evolution of the soybean stock-to-usage numbers for the United States and World levels.

We want to connect these stock-to-usage numbers with price of the corresponding soybean futures contracts. To do this we connect the price data between two successive WASDE reports with the first report and aggregate the results. As an example consider two reports dated 2018-05-11 and 2018-06-12 respectively. All price data between those two dates are associated with the first date.

The images below give a graphical representation of the data. The x- and y-axes represent the Stock-to-Usage and Price of the July contract respectively.
From the images we can distinguish between two different regimes roughly corresponding to before and after 2007. This can be seen by the clear separation between blue and orange points in the plot below. I am not sure if there is some fundamental reason for the separation of not. It might have to do with pre and post GMO.

# 2 Deterministic Model

From the bubble chart above it looks like a linear model should be sufficient to model the July corn price as a function of stock-to-usage. Here we look at a couple slight modifications to improve upon the simple linear model. It looks like the data clumps together around \$4 for stock-to-usages greter than 15. We see that the prices are decreasing at a slower rate with increasing stock-to-usage numbers. Linear models on the other hand assume a constant rate of decrease. Here we look at two alternative models, a power-law and exponential model, both of which have decreasing rates of change.

To find the best model it amounts to looking at the three different graphs below and deciding which has the best linear fit to the data. The equation describing the models are given below

Linear: $y = x \times m + c$

Power-Law: $y = x^{m} \times \exp\left(c\right)$

Exponential: $y = \exp\left( x \times m + c \right)$ In all three equations above $$x$$ and $$y$$ represent stock-to-usage and price respectively.

By eye the results look fairly similar. The table below summarises the results and provides the coefficients to plug into the model. The best fit models seems to be the Power-Law model. For best results we recommend averaging the results of the three models.

The table below summarises the results of the model fitting. Each cell shows the R-squared value of the fit. The models with the greatest R-squared values are shown at the top. From this naive in-sampe point of view we can see that United States Corn Stock-to-Usage is the best predictor followed by world and world withouth China Stock-to-Usages. The next most important features are Mean Month Prior Crude and the Dollar Index. In the following we have a closer look at the relationship between price and the main predictive features according to the table below.

variable exponential linear power law
unitedstates_Oilseed, Soybean_s2u 0.4395780 0.4321285 0.5710330
crude 0.5895444 0.5790341 0.5597231
dollarindex 0.4290438 0.4177805 0.4334389
ruble 0.3619882 0.3530456 0.3468072
worldnochina_Oilseed, Soybean_s2u 0.2625536 0.2691300 0.2795387
world_Oilseed, Soybean_s2u 0.2213625 0.2252969 0.2227520
argentina_Oilseed, Soybean_s2u 0.1521795 0.1573237 0.1698917
brazil_Oilseed, Soybean_s2u 0.0015182 0.0008845 0.0052087
china_Oilseed, Soybean_s2u 0.0082744 0.0083886 0.0006596

The model with the best fit is a power-law using the United States stock-to-usage percentage as input variable. The model coefficients are given in the table below.

model Rsq m c code variable
linear 0.4823355 6.5006113 599.854044 F crude
power law 0.4675752 0.4078876 5.231526 F crude
exponential 0.4977970 0.0057073 6.547623 F crude
linear 0.4138634 -28.8186056 1355.015189 F unitedstates_Oilseed, Soybean_s2u
power law 0.5854263 -0.3014871 7.618989 F unitedstates_Oilseed, Soybean_s2u
exponential 0.4342110 -0.0255076 7.212684 F unitedstates_Oilseed, Soybean_s2u
linear 0.5559664 6.7695757 592.863438 H crude
power law 0.5345403 0.4141386 5.218223 H crude
exponential 0.5679834 0.0059509 6.541570 H crude
linear 0.4091224 -27.7987030 1352.732936 H unitedstates_Oilseed, Soybean_s2u
power law 0.5630872 -0.2855880 7.589598 H unitedstates_Oilseed, Soybean_s2u
exponential 0.4205695 -0.0245096 7.210308 H unitedstates_Oilseed, Soybean_s2u
linear 0.5871824 6.7453400 602.047083 K crude
power law 0.5562028 0.4100565 5.243708 K crude
exponential 0.5942793 0.0058951 6.553050 K crude
linear 0.4190105 -27.5231946 1358.016683 K unitedstates_Oilseed, Soybean_s2u
power law 0.5675670 -0.2772393 7.577671 K unitedstates_Oilseed, Soybean_s2u
exponential 0.4306562 -0.0242366 7.215487 K unitedstates_Oilseed, Soybean_s2u
linear 0.5790341 6.6052523 619.260404 N crude
power law 0.5597231 0.4009241 5.289391 N crude
exponential 0.5895444 0.0057892 6.567674 N crude
linear 0.4321285 -28.0275882 1365.802990 N unitedstates_Oilseed, Soybean_s2u
power law 0.5710330 -0.2720425 7.569552 N unitedstates_Oilseed, Soybean_s2u
exponential 0.4395780 -0.0245538 7.221882 N unitedstates_Oilseed, Soybean_s2u
linear 0.5975118 6.4673339 624.944589 Q crude
power law 0.5730959 0.3885720 5.338746 Q crude
exponential 0.6071269 0.0057222 6.569714 Q crude
linear 0.4151060 -25.7105105 1335.983732 Q unitedstates_Oilseed, Soybean_s2u
power law 0.5485384 -0.2531438 7.522352 Q unitedstates_Oilseed, Soybean_s2u
exponential 0.4215474 -0.0227417 7.198769 Q unitedstates_Oilseed, Soybean_s2u
linear 0.6127232 5.9661302 644.047609 U crude
power law 0.5843266 0.3610140 5.441296 U crude
exponential 0.6205875 0.0053886 6.580067 U crude
linear 0.3815607 -22.0924687 1280.633436 U unitedstates_Oilseed, Soybean_s2u
power law 0.5079560 -0.2231451 7.440573 U unitedstates_Oilseed, Soybean_s2u
exponential 0.3847061 -0.0199087 7.154625 U unitedstates_Oilseed, Soybean_s2u
linear 0.7123302 5.9810608 640.341713 X crude
power law 0.6786102 0.3643230 5.426597 X crude
exponential 0.7215076 0.0054674 6.572670 X crude
linear 0.3305622 -19.1657599 1248.161299 X unitedstates_Oilseed, Soybean_s2u
power law 0.4394429 -0.1964125 7.378100 X unitedstates_Oilseed, Soybean_s2u
exponential 0.3286771 -0.0173583 7.126861 X unitedstates_Oilseed, Soybean_s2u

Taking the values from the table above we plot the model predictions in blue. The latest USDA United States stock-to-usage is given by the vertical orange line. The horizontal orange line gives the latest S N9 price. The results can be interpreted in two ways. If we take the USDA numbers as the truth we need to see a downward adjustment in price. On the other hand we can imply a stock-to-usage from the latest price. Currently this number is much less than that reported by the USDA.

The plot below takes the data from above and constructs a futures curve for each input feature and model.

# 3 Probabilistic Model

If we discretise the stock-to-usage percentages we are able to do some statistics on the values of the prices given stock-to-usage (or any other feature) in the discretised basket. In this way we can perform Bayesian statistics on the prices, i.e. given a forecast on the stock-to-usage we can determine the probability that the price is contained withing some interval.

In the subsections below we show plots of the price statistics when the value of the underlyiing feature falls within the bucket specified on the x-axis. The solid black line shows the median price. The light and dark shaded regions show the 10th to 90th and 25th to 75th percentiles. The fat of the distributions lie withing the dark shaded region. For reference we also show the USDA and Polar Star fundamental forecast together with the latest price data. These are represented by the vertical and horizontal lines respectively. The same data used to create the images is also given in tabular form below the plots.

## 3.1 United States Stock-to-Usage

p10 p25 p50 p75 p90
(4.03,5.97] 1229.950 1281.2500 1372.250 1427.500 1464.550
(5.97,7.89] 900.950 944.1250 976.750 1038.500 1201.850
(7.89,9.81] 935.300 1004.8750 1184.500 1302.875 1421.100
(9.81,11.7] 950.750 975.7500 1004.000 1035.750 1066.000
(11.7,13.6] 879.075 894.8750 965.625 1039.250 1057.425
(13.6,15.6] 894.450 900.2500 921.750 939.750 945.500
(15.6,17.5] 933.100 935.7500 942.125 956.500 959.650
(17.5,19.4] 884.800 898.7500 914.000 923.875 939.700
(19.4,21.3] 891.250 900.8125 912.000 922.375 934.800
(21.3,23.3] 881.600 912.0000 923.250 938.750 946.850

## 3.2 World Stock-to-Usage

p10 p25 p50 p75 p90
(23.3,25.3] 1011.350 1052.3750 1291.000 1429.875 1504.600
(25.3,27.3] 917.000 977.9375 1309.625 1403.812 1452.000
(27.3,29.3] 941.300 995.3750 1254.875 1376.250 1419.375
(29.3,31.2] 956.150 977.5000 1034.000 1166.750 1249.950
(31.2,33.2] 881.975 916.0000 1041.625 1260.750 1307.950
(33.2,35.2] 900.950 940.2500 986.000 1182.250 1309.450
(35.2,37.2] 915.300 944.3750 994.750 1014.375 1029.100
(37.2,39.2] 877.150 892.6875 913.750 937.750 1092.950
(39.2,41.2] 931.050 970.2500 997.000 1048.875 1062.600
(41.2,43.2] 920.350 931.7500 947.500 990.875 1023.200

## 3.3 World Stock-to-Usage without China

p10 p25 p50 p75 p90
(15.8,16.9] 976.050 1318.6250 1396.000 1444.8750 1507.850
(16.9,18] 996.700 1268.7500 1369.250 1425.0000 1454.850
(18,19.1] 942.850 1018.7500 1223.000 1347.6250 1396.550
(19.1,20.3] 899.250 947.2500 1016.375 1150.0000 1240.000
(20.3,21.4] 883.950 925.0000 972.500 1052.2500 1278.950
(21.4,22.5] 912.900 983.6250 1018.875 1243.0625 1307.000
(22.5,23.6] 913.525 916.8750 939.125 968.1250 979.825
(23.6,24.7] 867.875 890.8125 909.250 919.4375 932.125
(24.7,25.8] 891.250 915.5000 942.750 1052.7500 1106.750
(25.8,27] 941.550 961.5000 985.500 1023.1250 1056.350

## 3.4 Mean Crude

p10 p25 p50 p75 p90
(33.3,43.8] 872.700 883.0000 898.750 956.250 1001.100
(43.8,54.3] 905.000 963.6250 995.000 1022.312 1049.750
(54.3,64.7] 911.925 929.3750 957.750 996.500 1044.750
(64.7,75.2] 890.400 910.1875 933.000 987.500 1056.025
(75.2,85.6] 959.350 991.6250 1040.000 1225.500 1295.050
(85.6,96] 1108.225 1284.5625 1382.375 1433.250 1474.675
(96,106] 1104.500 1235.0000 1299.750 1389.750 1448.050
(106,117] 1148.475 1230.4375 1253.375 1322.688 1367.750
(117,127] 1226.050 1268.8125 1333.125 1370.500 1390.950
(127,138] 1273.600 1336.7500 1428.250 1450.625 1488.900

## 3.5 Dollar Index

p10 p25 p50 p75 p90
(72.1,75.2] 1254.600 1297.500 1363.500 1396.250 1442.750
(75.2,78.2] 986.500 1057.750 1214.500 1317.125 1382.750
(78.2,81.3] 950.000 1009.500 1262.250 1402.875 1449.500
(81.3,84.3] 948.500 1025.000 1252.750 1417.500 1500.300
(84.3,87.3] 866.875 921.750 1002.125 1039.438 1062.875
(87.3,90.3] 950.900 994.500 1039.500 1056.000 1063.350
(90.3,93.4] 993.100 1000.812 1007.000 1016.375 1023.550
(93.4,96.4] 891.500 912.500 956.250 991.500 1015.750
(96.4,99.4] 879.300 900.625 928.000 965.375 989.350
(99.4,103] 956.800 974.125 1027.750 1053.875 1070.700

# 4 Ensemble Model

We have created ensemble machine learning models that predict the soybean price along the futures curve. These models take as inputs the stock-to-usage percentages of the top soybean producing and consuming nations together with the dollar index and month prior average crude price as proxies for the US Dollar and energy respectively.

The ensemble models we create are all random forest regression models. We create a train and test split and perform hyper parameter tuning on the training set using 3 fold cross-validation. Ensemble models are a natural extension of the single variable deterministic models in that they are able to gain from possible interactions between the different input features.

From the best models we determine the variable importance of all the input features. The results are sumarised in the plot below. The greater the importance the larger the effect of that feature on the predicted values. The most important feature for each of the two different classes of wheat and contract codes are highlighted in orange.

Notice that the features with greatest importance is crude and uniteststates_Oilseed, Soybean_s2u. In all the cases shown above these two features make up more than 70% of the variable importance of the ensemble models. The table below gives the R-squared values of the ensemble models fitted to the data. Notice the significant improvement over the deterministic models.

In the plot below we aggregate all the feature importances along the curve into a single representation.

R squared
K 0.89
N 0.89
Q 0.92
U 0.70
X 0.83
F 0.78
H 0.77

## 4.1 United States Stock-to-Usage Sensitivity

As the United States Stock-to-Usage percentages increase we expect the price of soybeans to decrease. This intuition is confirmed in the plots below. The y-and x-axis show the prediction and value of United States Stock-to-Usage respectively. Here we fix all parameters to the latest WASDE numbers, but allow the value of United States Stock-To-Usage the change. In the plots below we see the quasi monotonic decreasing relationship between the two variables. We can also see transition values that resembles a phase transition for values of United States Stock-to-Usage around 6.

## 4.2 Crude Sensitivity

As the cost of energy increases we expect the price of soybeans to increase. This intuition is confirmed in the plots below. The y-and x-axis show the prediction and value of crude respectively. Here we fix all parameters to the latest WASDE numbers, but allow the value of the prior month crude price the change. In the plots below we see the monotonic increasing relationship between the two variables. We can also see an elbow forming at crude prices greater than 50. We are currently at crude prices greater than this which might signify a possible increase in the price of soybeans.

# 5 Remove Crude

Here we create models without crude to compare with the previous models. Below we show the feature importances of these new models. As we might have expected, the most dominant features are the ones that were next in line in the original models above.

# 6 Only Crude an United States stock-to-usage

Here we create models using only crude and United States Stock-to-usage to compare with the previous models. Below we show the feature importances of these new models.

The table below shows the R-squared values of the models with and without crude. Notice that the models that contain crude as a feature perform slighlty better than those without crude. Overall the results withour crude are still good.

code all features only crude and USA without crude
F 0.78 0.79 0.80
H 0.77 0.77 0.58
K 0.89 0.88 0.86
N 0.89 0.92 0.71
Q 0.92 0.93 0.92
U 0.70 0.76 0.82
X 0.83 0.83 0.88

# 7 Predictions

The plot below shows the ensemble model predictions for USDA forecasted fundamentals. It is difficult to pin down the value of crude, so we consider a range of values form 55 to 65. Furthermore we consider all the predictions from each of the decision trees model to determine prediction statistics. The normal output of a collection of regression trees is the mean of all the predictions. In the plot below we sohw the 25th to 75th percentiles of the predicted prices, this corresponds to the area between the two gray curves. The latest price data is represented by the black curve. The median model prediction is shown in blue. here we use the median as it is les likely to be skewed by possible outliers.