Towards Parallel Gradient Computations and Deep Decentralized Optimization: PETRA and A2CiD2

Speaker: Edouard Oyallon

Location: 60 Fifth Avenue, Room Room 527
Videoconference link: https://nyu.zoom.us/j/91051481394

Date: Tuesday, June 11, 2024

Reversible architectures have demonstrated comparable performance to non-reversible architectures in deep learning, particularly for memory savings and generative modeling. I will introduce PETRA, a novel method for parallelizing gradient computations, addressing challenges in deep model training. PETRA enables effective model parallelism by allowing independent computation of stages on different devices, requiring only communication of activations and gradients, decoupling forward and backward passes and eliminating weight stashing. We demonstrate its effectiveness, achieving competitive accuracies comparable to backpropagation using ResNet models on ImageNet. Additionally, I will present A2CiD2, a principled algorithm to accelerate communication in decentralized deep learning using an asynchronous approach. Traditional synchronous methods face significant communication bottlenecks at scale. A2CiD2, a randomized, gossip-based optimization algorithm, leverages continuous local momentum, enabling uninterrupted mini-batch processing by workers. This method accelerates communication without additional costs, outperforming previous decentralized baselines. Theoretical analysis and empirical results on ImageNet (on A100 GPUs) demonstrate substantial reductions in communication costs, showcasing the effectiveness of A2CiD2 on both poorly and well-connected networks.