更具体地说,我们引入了联邦平均(FederatedAveraging)算法,它将每个客户机上的局部随机梯度下降(SGD)与执行模型平均的服务器相结合。我们对该算法进行了大量的实验,证明了该算法对不平衡和非IID数据分布的鲁棒性,并能将在分散数据上训练深度网络所需的通信次数减少若干个数量级。 Federated Learning 联邦学习
[3]https://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd.pdf
Zhang, S., Choromanska, A.E., LeCun, Y.: Deep learning with elastic averaging SGD. Adv. Neural Inf. Process. Syst.28, 685–693 (2015) Google Scholar Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. ...
As for distributed training, we use elastic averaging stochastic gradient descent (EASGD) algorithm to reduce communication. On 512 processes, we get a parallel efficiency of 81.01% with communication period 蟿=8. Particularly, a decentralized implementation of distributed swFLOW system is presented ...
Zhang, S., Choromanska, A.E., LeCun, Y.: Deep learning with elastic averaging SGD. Adv. Neural Inf. Process. Syst. 28, 685–693 (2015) 27. Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: ...
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark. - cerndb/dist-keras
That region in the input image is called thelocal receptive fieldfor the hidden neuron. It's a little window on the input pixels. Each connection learns a weight. And the hidden neuron learns an overall bias as well. You can think of that particular hidden neuron as learning to analyze it...
Deep learning with elastic averaging SGD. Adv. Neural Inf. Process. Syst. 2015, 28, 685–693. [Google Scholar] Wang, J.; Chen, Y.; Hao, S.; Peng, X.; Hu, L. Deep learning for sensor-based activity recognition: A survey. Pattern Recognit. Lett. 2019, 119, 3–11. [Google ...
How can we address the learning slowdown? It turns out that we can solve the problem by replacing the quadratic cost with a different cost function, known as the cross-entropy. To understand the cross-entropy, let's move a little away from our super-simple toy model. We'll suppose instea...
What's going on here? Is it that the expanded or extra fully-connected layers really don't help with MNIST? Or might it be that our network has the capacity to do better, but we're going about learning the wrong way? For instance, maybe we could use stronger regularization techniques ...