The activation function is responsible for returning the result of the set of values, usually covering numbers from 0 to 1. The whole process is based on making a summation of the input value, multiplied by the weight of each neuron and added a certain bias, then, an activation function is applied, of which there are many variety and some of them will be explained in this blog, obviously share certain things in common, such as that the derivatives are simple to save computational thinking and training time.

## Binary step Function

The Binary step function is one of the simplest activation functions, it is based on the fact that if the result of the summatory mentioned above is greater or equal than 0 the result is 1, if it is less than 0, the result is 0.

`f(x) = 1, x >= 0`

= 0, x < 0

## Linear activation function

The linear activation function is one in which the activation is proportional to the input.

The function does nothing to the weighted sum of the input, it simply returns the value given to it.

## Sigmoid activation function

The sigmoid function transforms the entered values to a (0,1) scale, where high values tend asymptotically to 1 and very low values tend asymptotically to 0.

`f(x) = 1 / (1 - e^(-x))`

It has some features like:

- Saturates and kills the gradient.
- Slow convergence.
- It is not centered at zero.
- It is bounded between 0 and 1.
- Good performance in the last layer.

the gradient values are only significant for range -3 to 3, and the graph gets much flatter in other regions.

It implies that for values greater than 3 or less than -3, the function will have very small gradients. As the gradient value approaches zero, the network ceases to learn and suffers from the *Vanishing gradient* problem.

## Tangent Hyperbolic function

The hyperbolic tangent function transforms the values entered to a scale (-1,1), where high values tend asymptotically to 1 and very low values tend asymptotically to -1.

`f(x) = (e^x - e^(-x)) / (e^x + e^(-x))`

Features of tanh:

- Saturates and kills the gradient.
- Slow convergence.
- Centered at 0.
- It is bounded between -1 and 1.
- It is used to decide between one option and the opposite.
- Good performance in recurrent networks.

it also faces the problem of vanishing gradients similar to the sigmoid activation function. Plus the gradient of the tanh function is much steeper as compared to the sigmoid function.

The ReLU function, standed for Rectified Linear Unit, transforms the values entered by canceling the negative values and leaving the positive values as they enter

`f(x) = max(0, x) = 0 for x < 0`

x for x >= 0

Features of ReLU:

- Sparse activation, only activated if they are positive.
- Not bounded.
- Can makes the gradient value zero. Due to this reason, during the backpropagation process, the weights and biases for some neurons are not updated. This can create dead neurons which never get activated.Performs well with images.
- Good performance in convolutional networks.

Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and tanh functions.

## SoftMax activation function

The Softmax function transforms the outputs to a representation in the form of probabilities, such that the sum of all the probabilities of the outputs of 1.

It is most commonly used as an activation function for the last layer of the neural network in the case of multi-class classification.

this formula represents the activation function softmax, where **i** is the class index and j is the number of classes.

features of SoftMax function:

- It is used when we want to have a representation in the form of probabilities.
- It is bounded between 0 and 1.
- Highly differentiable.
- It is used to normalize multiclass type.
- Good performance in the last layers.