enforcement of the structured sparsity

Learning trading rules by enforcement of the structured sparsity to MLP

Trading rules for market entry
To specify conditions for market entry, traders first define factors or variables used to represent a market state and then construct a rule based on these variables.

A hypothetical trading rule may look like this:

\(x_1 / x_2 > k_1\) AND \(x_3-x_4 > k_2 => BUY\) (1)

In the given example, the input variables \(x_1, x_2, x_3, x_4\) are used to represent the market state, and the logical expression specifies conditions under which a trading signal takes place.

By playing with the rule or back-testing it on the market data the trader will further try to adjust the threshold levels \(k_1, k_2\) to estimate impact of the changes on the frequency of trading, a trading gross/net P&L and a risk profile of the trading system.

To increase turnover and P&L, the trader can try to relax threshold levels. As a side effect, it is very likely that he will observe a degradation of some important risk metrics. In an attempt to fix the issues with risk, the trader can try to move thresholds back to more extreme levels.

In other words, the expected payout from trades caused by rule signals is not constant across potential solution space and depends upon state of the market defined by \([x_1,x_2,x_3,x_4]\).

Assuming fixed profit taking and stop loss levels (in basis points) for each trade, the payout linearly depends on the expected probability of win in the given market state which can be thought of as strength of the trading signal in the particular market condition.

Multilayer Perceptron (MLP) to learn trading rules

Multilayer Perceptron (MLP) is a multilayer feed forward neural network. Each layer of such network performs first a linear transformation of the input vector (the output of the previous layer) and then applies non-linear activation function to the result of the linear operation.

The linear operation can be expressed as a matrix multiplication followed by a bias addition \(\textbf{z}=\textbf{W}⋅\textbf{x}+\textbf{b}\), where \(\textbf{W}\) is the weight matrix, \(\textbf{x}\) is the input vector (output of the previous layer), \(\textbf{b}\) is the bias vector and \(\textbf{z}\) is the output vector.

The non-linear transformation is a point-wise operation applied to the output of the linear operation \(\textbf{a}(\textbf{z})\)

The MLP architecture to learn trading rules (the fully connected network case)
One possible architecture to learn trading rules with fully-connected MLP is provided below.

Input
This is the input vector representing our market state. For the rule (1) it contains 4 variables defining the market state:
\(\textbf{x} = [x_1,x_2,x_3,x_4]\)

The first hidden layer
The first hidden layer is used to learn components of the rule (for example, the inequalities in rule (1)). Each component in the fully connected MLP is represented by linear operation:

\(z(\textbf{x}) = w_1⋅x_1 + w_2⋅x_2 + w_3⋅x_3 + w_4⋅x_4 + b\) (2)

and non-linear activation function used to learn comparison operation. For non-linear transformation we can use well known functions:

sigmoid \(\sigma_k(z) = 1 / (1 + exp(-k⋅z))\), \(tanh\), \(ReLU\)

or try more exotic, e.g.: \(swish\):

\(swish_k(z) = z ⋅ \sigma_k(z)\)

Subsequent hidden layers
The subsequent hidden layers are used to learn more complex logical expressions over the components of the trading rule. Most popular (from own practice) is AND but OR, XOR and other operators could also be applied.

Output layer
That output layer is used to map the output of the MLP to one dimensional output in the interval \([0,1]\).
Sigmoid \(\sigma_k(z)\) could be selected as the activation function for this layer.

Enforcing a “structured” sparsity on MLP

Let’s return now to the trading rule (1). There are two components there. The first component is a kind of inequality between variables \(x_1\) and \(x_2\):

\(x_1 – k_1 ⋅ x_2 > 0\) (3)

A second component is a kind of inequality between variables \(x_3\) and \(x_4\):

\(x_3 - x_4 - k_2 > 0\) (4)

For the sake of simplicity let's assume the first layer contains two neurons and let’s explicitly enforce a sparsity condition on the weight matrix \(\textbf{W}\) of the first layer.

Then the linear transformation for the left hand side of (3) will look like:

\(z_1(x_1, x_2, x_3, x_4) = w_{11}⋅x_1 + w_{12}⋅x_2 + \textbf{0}⋅ x_3 + \textbf{0}⋅ x_4 + b_1\) (5)

And for the left hand side of (4):

\(z_2(x_1, x_2,x_3,x_4) = \textbf{0}⋅x_1 + \textbf{0}⋅x_2 + w_{32} ⋅ x_3 + w_{42} ⋅ x_4 + b_2\) (6)

Matrix \(\textbf{W}\) of the first layer with enforced sparsity condition will then look like:
\[
W = \begin{pmatrix}
w_{11} & w_{12} & 0 & 0 \\
0 & 0 & w_{23} & w_{24} \\
\end{pmatrix}
\]

What we get by applying the “structured” sparsity?
First of all, this zero-cost 'trick' tweaks the original fully connected MLP to better fits your original trading rule design.

In general, grouping of input variables by enforcing sparsity condition makes sense if:

Grouping is supported by theory or empirical evidence, so it can guide the model to find a more interpretable solution.
You want to prevent establishing the non-existing relationships between uncorrelated variables feeded to the model.

Also sparse networks consume less resources as they can be optimized to skip computations involving zero values and skip estimation of the gradients for zero weights on the backward pass (for gradient based optimization).

Further steps in tweaking MLP architecture to fit trading rule design
The next and even more cardinal step in fitting MLP to trading rule deign is to assign constants to some elements of the weight matrix and bias vector leaving just a few parameters of the first layer of the network to learn (for example, thresholds).

Experimental Results
A few tests have been conducted with the trading model trained on the fully connected network; and then on sparse version of the network. In the second case, we placed weakly correlated variables into a few different groups.

In both cases the trained models demonstrated close performance on out-of-sample run.

However, training of the sparse network was consistently faster. It took in average 5 times less training cycles on the sparse network to find acceptable solution than on the fully-connected network.