Asynchronous Stochastic Gradient Descent with Diminishing stepwise

Introduction

In recent years, artificial intelligence (AI) becomes a hot topic that takes the attention of many researchers. Many people start to seek a machine learning method that not only trains the model in a distributed environment (Distributed learning) but also protects the privacy of data (for example, sensitive data in the medical or financial field or personal habits) (Federated learning).

https://tverma-77443.medium.com/international-business-plan-ppt-templates-list-6f5476986053
https://tverma-77443.medium.com/writing-introductions-in-research-news-papers-251adae27ec
https://tverma-77443.medium.com/reimagining-marketing-with-digital-transformation-idea-67ec401b46b
https://www.ica.org/sites/default/files/webform/auckland-fireworks-new-years-eve.pdf
https://www.ica.org/sites/default/files/webform/australia-new-years-eve-fireworks.pdf
https://www.ica.org/sites/default/files/webform/brisbane-new-years-eve-fireworks.pdf
https://www.ica.org/sites/default/files/webform/christchurch-new-years-eve-fireworks.pdf
https://www.ica.org/sites/default/files/webform/melbourne-new-years-eve-fireworks.pdf
https://www.ica.org/sites/default/files/webform/new-zealand-new-years-eve-fireworks.pdf
https://www.ica.org/sites/default/files/webform/sydney-new-years-fireworks.pdf

In theory, there are two basic designs for distributed learning: asynchronous or synchronous training algorithms. While there is a line of work on deep learning has used asynchronous training successfully, recently people have changed their tendencies towards synchronous large batch training. Besides, several approaches to enhancing privacy guarantees for distributed learning, namely Federated learning, including differential privacy [6], secure aggregation [4,5] and parameter exchange under the encryption mechanism [3,9], require synchronization points from the server side to take effect, so synchronous training algorithms have become a practical and flexible method to use compared to other architectures.

Based on the above properties of the distributed learning environment, we can conclude that an efficient distributed learning training algorithm needs to fulfill the following requirements [9,21]:

(i) The distributed learning system needs to ensure that all data from workers be learned. It means that there is no data loss because of reducing the amount of data transfer, which would inevitably disturb the convergence of the model training.

(ii) The distributed learning system needs to be robust to non-iid, small batch sizes, and unbalanced data as well as a large number of clients and partial client participation.

(iii) The distributed learning system usually assumes that all clients have the same computation power, so it makes this method be hard to deploy or use in reality due to these constraints. In other words, the distributed learning system needs to work flexibly and effectively in a heterogeneous environment.

In this work, we propose a simple strategy that combines the asynchronous SGD | Hogwild! with a distributed learning framework to deal with these requirements above:

(i) The client or group of clients are allowed equal access to the global model and can update the global model at their wills to speed up the converge of model training.

(ii) The clients will be grouped based on computation power, geographical location, and network bandwidth to create a network of equal computational devices, which satisfies the fundamental requirements of distributed learning. Therefore, our architecture can be applied or deployed into reality, such as on a network of mobile or IoT devices.

(iii) We can apply the new method, which increases the number of sampling per communication round and reduce the stepsize over round so that we can eventually boost the convergence rate and reduce the communication round.

Background

Gradient descent and Stochastic Gradient Descent

We are interested in minimizing the finite-sum F(w), or a loss function (represents optimization problem in machine learning), which in full generality takes the form (1):

The functions f_i where 1≤ i ≤ n is assumed to be convex and has Lipschitz continuous derivatives with constant L. The function R(w) is referred to as a regularizer, which is practically used to prevent over-fitting or enforce structural properties of the solution. Most common choices of R(w) are l_2 = lambda/2 w² or l_1 = lambda ||w|| regularizers for some choice of lambda > 0. For simplicity, we assume that R(w) = 0 for all w, so the problem (1) becomes we try to minimize a simpler finite-sum problem (2):

A traditional approach is to use gradient descent (GD) [1] to find a global optimal solution for problem (2). Specifically, the GD algorithm performs the iteration:

Here, eta_t > 0 is the stepsize.

Nevertheless, common practical problems in machine learning usually need to deal with vast amounts of data, which makes GD impractical for very large scale machine learning problems, as it needs to process the whole dataset in order to evaluate a single gradient and update the model. An alternative is stochastic gradient descent (SGD) [2,15], which uses a randomized algorithm, the computational complexity of which is independent of n, in a single iteration. In theory, SGD samples a random training sample i have chosen from 1≤ i ≤ n, and performs the update (4):

where eta_t > 0 is the stepsize at SGD iteration t.

Intuitively speaking, this method works because if i is sampled uniformly at random from indices 1≤ i ≤ n, the update direction is an unbiased estimate of the gradient:

However, noise introduced by sampling slows down the convergence, and a diminishing sequence of stepsizes eta_t is necessary for the method to converge. The detailed algorithm of SGD is illustrated in Algorithm 1.

Hogwild! algorithm

We discuss the parallel processing setup Hogwild! algorithm [7,16], which takes advantage of modern CPUs by running parallel asynchronous SGD without any locking. The typical computational model is that: we assume a shared memory model with M processors and the global variable w is accessible to all processors. Each processor can read the global value w and can write an atomic update value of w via their own gradient value w_p, which 1≤ p ≤M. The detail of Hogwild! is described in Algorithm 2.

A major advantage of the iterative formula (4) is that it can be used to track the algorithmic progress and establish convergence rates to the optimal solution. Unfortunately, the progress of asynchronous parallel algorithms cannot be precisely analyzed using the above iterative framework. This is because processors cannot read from memory actual full global w, as there is no global clock that synchronizes reads or writes among cores, which leads to the inconsistent reads of the shared variable.

Consistent with delay \tau [16]:

We say that shared memory w is consistent with delay \tau if, for all t, the weight \hat{w}_{t} includes the aggregate of the updates up to and including those made during the (t — \tau)-th iteration.

In other words, we can represent \hat{w}_{t} with the max delay \tau as below:

And the formula (4) becomes:

Specifically, Hogwild! algorithm can access or update a set of coordinates in w_t without the use of any locking that would ensure a conflict-free execution. The reads or writes of distinct cores can overlap in arbitrary ways, for example, while a core updates a subset of variables, other cores can access or update the same variables.

Downpour SGD

Downpour SGD [8] was invented by researchers in Google as part of their DistBelief framework (predecessor of TensorFlow) and its variants are still used in today’s distributed TensorFlow framework. It makes use of multiple model replicas that each calculate updates based on a small portion of data (data parallelism). After calculation, the updates are sent to a centralized parameter server. The parameter server itself is split into multiple nodes, each holding and updating a small subset of the parameters (model parallelism). This parallelization scheme is very popular up to this date, not least due to it being a built-in feature in TensorFlow. However, the convergence of this scheme suffers from the fact that model replicas do not share parameters (weights) or updates with each other. This means that their parameters can always diverge. The detailed algorithm of downpourSGD is illustrated in Algorithm (3).

Averaging Federated Learning

Federated learning (FL) as originally introduced and described in [11,13] is depicted in Figure 4: The participants in FL are a server containing a global model and clients Client_j with 1≤ j ≤n having their own datasets that are not shared with each other. There are many rounds for training a global model of w. At the i-th round, the server propagates the currently computed global model (weight vector) w_i to a set of chosen clients (denoted as S_i). Each chosen client Client_j takes w_i and performs training/learning based on its own dataset. For example, clients use Stochastic Gradient Descent (SGD) for training and learning. This creates a new local model based on w_i which is specific to Client_j. A new global model w_{i+1} is then created by the server as an average of the local models once these are received.

Original FL requires synchrony between the server and clients. It requires that each client sends a full model back to the server in each round and each client needs to wait for the next computation round. For large and complicated models, this becomes the main bottleneck due to the asymmetric property of internet connection and the different computation power of devices [12, 17, 14, 20]

Figure 4 — Averaging Federated Learning

Our proposed asynchronous Distributed learning

e want to combine distributed learning [13], downpourSGD [8], and Hogwild! [16], which takes advantage of computation power in devices and guarantees the convergence of the global model.

(i) Clients and server can work in an asynchronous fashion [8,11,13]. It means that the clients can take full advantage of their computation capacity to train the model received from the server.

(ii) The client will linearly increase their number of sampling per round and reduce the stepsize by 1/t for a strongly convex case and 1/sqrt{t} for a plain convex problem [18].

(iii) Based on the Hogwild! algorithm [16], we can use a constant stepsize for all iterations within a communication round, and it guarantees that the model will converge. This will not only boost the convergence speed but also reduces the communication cost among the clients and server.

(iv) We can extend this method with the DP-SGD method [10](adding noise to the gradient before sending it back to the server) to protect the privacy of clients.

Algorithm 4 and Algorithm 5 describe our asynchronous distributed learning algorithm with decaying step size scheme and linearly increasing number of sampling in the client and server sides.

Some experiments

Convergence rate of strongly convex problem (phishing dataset)

Convergence rate of plain convex problem (phishing dataset

Test error of asynchronous distributed learning (MNIST dataset)

Conclusion

The report shows the technical ways to take advantage of some state-of-the-art algorithms and design distributed machine learning, such as Hogwid!, DownpourSGD, Averaging Federated learning. Our goal is to design a distributed learning system that could (i) take advantage of all clients’ computation capacity and (ii) work in an asynchronous fashion. Moreover, by using the decaying stepsize [16] and linearly increasing the number of sampling per round, we can gradually reduce the number of communication costs. We already conducted some experiments for all types of problems, such as strongly convex, plain convex, and non-convex problems. The experiments show that our proposed algorithm can work well in reality and achieve a good convergence rate, compared to other methods.

References

[1] Augustin Cauchy. “Methode gen´ erale pour la r ´ esolution des syst ´ emes ` d’equations simultan ´ ees”. In:Comp. Rend. Sci. Paris25.1847 (1847), pp. 536–538.

[2] Herbert Robbins and Sutton Monro. “A stochastic approximation method.” In: The annals of mathe-matical statistics. (1951).

[3] R L Rivest, L Adleman, and M L Dertouzos. “On Data Banks and Privacy Homomorphisms.” In: Foundations of Secure Computation, Academia Press(1978).

[4] Wenliang Du, Yunghsiang Sam Han, and Shigang Chen. “Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification.” In: Proceedings of the 2004 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2004(2004).

[5] Sven Laur Dan Bogdanov and Jan Williamson. “A Framework for Fast Privacy-Preserving Computations.” In: In Proceedings of the 13th European Symposium on Research in Computer Security: Com-puter Security(2008).

[6] Cynthia Dwork. “Differential Privacy: A Survey of Results.” In: In Proceedings of the 5th InternationalConference on Theory and Applications of Models of Computation(2008).

[7] Benjamin Recht et al. “Hogwild: A lock-free approach to parallelizing stochastic gradient descent.” In: Advances in neural information processing systems(2011).

[8] Jeffrey Dean et al. “Large scale distributed deep networks”. In: Advances in neural information processing systems. 2012, pp. 1223–1231.

[9] Qingchen Zhang, Laurence T. Yang, and Zhikui Chen. “Privacy Preserving Deep Computation Modelon Cloud for Big Data Feature Learning.” In: IEEE Transactions on Computers(2015).

[10] Martin Abadi et al. “Deep learning with differential privacy”. In: Proceedings of the 2016 ACM SIGSACConference on Computer and Communications Security. 2016, pp. 308–318.

[11] Jianmin Chen et al. “Revisiting distributed synchronous SGD.” In: ICLR Workshop Track(2016).

[12] Jakub Koneˇcn`y et al. “Federated optimization: Distributed machine learning for on-device intelligence”.In:arXiv preprint arXiv:1610.02527(2016).

[13] H. Brendan McMahan et al. “Federated Learning of Deep Networks using Model Averaging”. In ICLRWorkshop Track(2016).[14] Kevin Hsieh et al. “Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds.” In:14thUSENIX Symposium on Networked Systems Design and Implementation (NSDI 17). (2017).

[15] Yurii Nesterov.Lectures on convex optimization. Vol. 137. Springer, 2018.

[16] Lam M Nguyen et al. “SGD and Hogwild! convergence without the bounded gradients assumption”.In: arXiv preprint arXiv:1802.03801(2018).

[17] Yang Chen, Xiaoyan Sun, and Yaochu Jin. “Communication-Efficient Federated Deep Learning with asynchronous Model Update and Temporally Weighted Aggregation”. In:arXiv preprint(2019).url:https://arxiv.org/pdf/1903.07424.pdf.

[18] Rong Ge et al. “The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares”. In: Advances in Neural Information Processing Systems. 2019, pp. 14951–14962.

[19] Tian Li et al. “Federated Learning: Challenges, Methods, and Future Directions”. In:arXiv preprintarXiv:1908.07873(2019).

[20] Luping Wang, Wei Wang, and Bo Li. “CMFL: Mitigating Communication Overhead for FederatedLearning.” In: IEEE International Conference on Distributed Computing Systems. (2019).

[21] Qiang Yang et al. “Federated machine learning: Concept and applications”. In: ACM Transactions on intelligent Systems and Technology (TIST)10.2 (2019), p. 12.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store