In short, summary:Do not use Simgoid and Tanh as an activation functionReason for using RELU as an activation function:accelerate convergence.
Because the sigmoid and the tanh are saturated. What is saturated? Personal understanding is to know the function curve and the direction of the curve graph of the two numbers: their guide is the shape of an inverted bowl, that is, the closer the target is, the smaller the corresponding guide number. And Raylew derivatives are invariant for parts greater than 0. So RELU will actually be able to pass the gradient to the previous lattice in BP.
Relu (linear correction function)Substitute the sigmoid function to activate neurons.
x=-10:0.001:10; relu=max(0,x); Метод представления функции %сегмента заключается в следующем %y=sqrt(x).*(x>=0&x<4)+2*(x>=4&x<6)+(5-x/2).*(x>=6&x<8)+1*(x>=8); reluDer=0.*(x<0)+1.*(x>=0); figure; plot(x,relu,‘r‘,x,reluDer,‘b--‘); Название (‘RELU FUNCTION MAX (0, X) (реальная линия) и ее производная 0,1 (пунктирная линия)‘); Legend (‘Relu Original Function‘, ‘Relu Guide‘); set(gcf,‘NumberTitle‘,‘off‘); SET (GCF, «Имя», «РЕССОВНАЯ ФУНКЦИЯ (реальная линия) и ее руководство (пунктирная линия)‘);
It can be seen that Relu is strongly saturated at x <0. نظرًا لأن عدد الأدلة x> 0 is 1, RELU can hold the non-DECAY gradient at x>0, which makes the problem of gradient disappearance easier. However, as the training progresses, some of the input data will fall into extreme saturation, which will cause the corresponding weight to be updated. This phenomenon is called “neuronal death”.
One problem that is often “criticized” is that the result is biased. , that is, the average result is greater than zero. The differential phenomenon and neuronal death together affect the network convergence.