Training NN with sigmoid

Blog

Training NN with sigmoid() and tanh() activation functions

Sat, Jan 18, 2025

Our experience in training neural networks (NN) using the standard logistic function (sigmoid) and hyperbolic tangent as the only activation functions of the network was quite negative in the past. However, minor generalization of the functions improved the situation.

Let’s remind sigmoid function definition: \(σ(x) = \frac{1}{1 + exp(-x)}\). Its derivative is defined by \(σ’(x) = σ(x)(1- σ(x))\)

The hyperbolic tangent is defined by \(tanh(x) = \frac{exp(x) – exp(-x)}{exp(x) + exp(-x)}\). Its derivative is defined by \(tanh’(x) = 1- tanh^2(x)\)

The relationship between \(σ(x)\) and \(tanh(x)\) is defined by \(tanh(x) = 2σ(2x) – 1\)

Both activation functions are very popular in ML community despite the fact their derivative tends to zero for large x (by absolute value), which hurts the learning speed of gradient based optimization.

Function plots:

“Slope” parameter for sigmoid and hyperbolic tangent activation functions

A more generic representation of logistic function is \(σ(x) = \frac{1}{(1 + exp(-k(x-x0))}\), where k controls a slope of the function and x0 controls a horizontal shift of the function. Let’s ignore shift for the moment. We have then:

\(σ_k(x) = \frac{1} {1 + exp(-kx)}\)

\(σ’_k(x) = kσ_k(x)(1- σ_k(x))\)

Adding a slope parameter k to hyperbolic tangent:

\(tanh_k(x) = 2σ_k(2x) – 1\)

\(tanh_k’(x) = k(1- tanh_k^2(x))\)

Function plots for \(σ = \frac{1}{2}\), \(1\), \(5\):

Let's explore how the slope parameter(s) considered as the hyper parameter(s) of the training can improve quality of training results. Results of the tests are presented below.

Swish

“Swish” activation function is a product of two components: \(x\) and \(σ_k(x)\).

It was introduced by Google Brain team a few years ago as an alternative to the ReLU activation function, its well-known modification - Leaky ReLU and to many other modifications.

\(swish_k(x) = xσ_k(x) = \frac{x}{1 + exp(-kx)}\)

Its derivative w.r.t \(x\):

\(swish’_k(x) = σ_k(x) + kxσ_k(x)(1- σ_k(x))\)

Plot of the “swish” function for \(k = 1\), \(5\), \(50\):

Results of assessment of swish will be presented in a separate post.

Training NN with \(σ_k(x)\) and \(tanh_k(x)\) activation functions

Let’s start training of NN models with \(σ_k(x)\) and \(tanh_k(x)\) activation functions having the slope parameter set to 1. Then let’s perform experiments with slope parameters of activation functions chosen randomly.

Test #1: training with \(k = 1\)

Create 30+ NN models sharing the same topology.
Initialize each NN weights randomly.
Run training the models in parallel using the same (mini) batches randomly sampled from the same training dataset, the same optimizer, learning rate and other hyper parameters. The size of a mini-batch is approximately 1/1000th of the size of the training dataset.
Stop learning process by collecting 5 instances of NNs matching the acceptance criteria or by hitting max number of training cycles (forward pass->backward pass->weight update): 250,000.
If the pool of the collected NN instances is not empty, back-test trading model which aggregates trading signals from those NN instances and executes them by market best bid/offer.

Test #2: training with k selected randomly

The only difference from the previous test is that when creating NN models, we not only randomly initialize NN weights, we also randomly choose slope parameter value for each activation function of the NN(s).

Results

We run our tests on the FX market data. The data from 2010 to 2017 are used for NN training while the data from 2018-2024 are reserved for true out-of-sample run.

Results of NN training for tests #1 and #2:

Back-test results for test #2:

Back-test results for test #2:

Summary

It is worth experimenting with different activation functions when running the NN trading model research. As shown above even such a minor modification of the standard activation functions helped surprisingly to move forward.