Tensorflow supports parallel training of deep learning models using multiple GPUs or machines, but before that, the principle of parallel learning and how to train deep learning models in parallel learning makes more sense. In general, the parallel training of deep learning models can be divided into two different methods of parallelism, and the model is parallelized in two different ways. In this section, these two methods will be introduced and compared with the superiority between the two inferior methods. First, look at the training process of deep learning models that are not used in parallel. The whole process is shown in the figure below. In each round of French generation training, the current parameter value and batch data are passed to the model, and the model model calculation results will be calculated. Parameters update The updated parameters will be propagated along with other batch data in the next round of training iteration.
When deep learning models are trained in parallel, using data in a parallel process will cause this French generation process to be executed on different devices (GPUs or CPUs), and using similarities between models will cause the computational scheme of the entire process to be divided into many sub-processes. They are powered by different hardware (GPUs or GPUs).
Data is in parallel synchronization mode
Data in parallel sync mode as shown below.
Figure 14-7 shows that at the beginning of each iteration round, these devices first read the value of the current parameter and receive the small batch data. Then, the model diffusion process was obtained through the diffusion process to operate the various devices, and the runtime feedback implementation process was obtained to obtain the gradient Δp of the parameters on the corresponding mini-batch. Because the training data is different, even if the parameters of all devices are the same, the final parameter gradient may be different. After all the devices complete the backpropagation calculation, the shared parameters server needs to calculate the average parameter gradient across the devices. Finally, one round of iterative learning.
Data is in asynchronous mode in parallel
Synchronous data can also be changed to asynchronous data. The figure below is a diagram of the parallel data model in asynchronous mode.
As can be seen from the above figure, during each round of iteration, different devices will read the latest parameter values and according to the current parameters and the received small batches, each goes to the predictive results model dispersion, plus the parameter p gradient on the small batches. Unlike synchronous mode, the process of updating parameters in asynchronous mode is also independent of each other, although each device reads the parameters from the same place. In asynchronous mode, the different devices are completely independent. This can be easily understood because the asynchronous mode is multiple backups of the permanent branch mode. Each backup uses different training data for training, and updates the parameters based on the obtained gradient values on its own. It should be noted that because different machines training the same model will consume different times, although we sometimes strive to use the same models of the same model, but due to the complexity of the internal workings of the model, this problem will cause this problem Questionnaires in combination with the independence between Each device, this can lead to different reading parameter times, that is, the parameter values received by these devices may be different before training the same wheel. We generally refer to this problem as asynchronous gradient noise. A more serious problem caused by this is that the deep learning model obtained by asynchronous mode training may not be able to achieve better training results.
Pay attention to the little whale SUNAC, and study financial technology in depth!