skip to main content
10.5555/3571885.3571899acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

HammingMesh: a network topology for large-scale deep learning

Published: 18 November 2022 Publication History

Abstract

Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep learning systems with extreme bandwidth requirements.

Supplementary Material

MP4 File (SC22_Presentation_Hoefler.mp4)
Presentation at SC '22

References

[1]
A. Karpathy, "Software 2.0," November 2017, [Online; posted 11-Nov-2017]. [Online]. Available: https://karpathy.medium.com/software-2-0-a64152b37c35
[2]
D. H. Dario Amodei, "Ai and compute," online https://openai.com/blog/ai-and-compute/, 05 2018.
[3]
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, "Scaling laws for neural language models," 2020.
[4]
A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, "Data Movement Is All You Need: A Case Study on Optimizing Transformers," in Proceedings of Machine Learning and Systems 3 (MLSys 2021), Apr. 2021.
[5]
NVIDIA Corporation, "NVIDIA Tesla V100 GPU Architecture," Tech. Rep. WP-08608-001 v1.1, 08 2017.
[6]
Xilinx Corporation, "Xilinx AI Engines and Their Applications," Tech. Rep. WP506 (v1.1) July 10, 2020, 07 2020.
[7]
B. Darvish Rouhani, D. Lo, R. Zhao, M. Liu, J. Fowers, K. Ovtcharov, A. Vinogradsky, S. Massengill, L. Yang, R. Bittner, A. Forin, H. Zhu, T. Na, P. Patel, S. Che, L. Chand Koppaka, X. SONG, S. Som, K. Das, S. T, S. Reinhardt, S. Lanka, E. Chung, and D. Burger, "Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point," in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 10271--10 281.
[8]
V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, "Efficient processing of deep neural networks: A tutorial and survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295--2329, 2017.
[9]
A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, "Survey of machine learning accelerators," in 2020 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, sep 2020. [Online]. Available: https://doi.org/10.1109%2Fhpec43674.2020.9286149
[10]
T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, "Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks," Journal of Machine Learning Research, vol. 22, no. 241, pp. 1--124, Sep. 2021.
[11]
P. Kogge and J. Shalf, "Exascale computing trends: Adjusting to the "new normal"' for computer architecture," Computing in Science Engineering, vol. 15, no. 6, pp. 16--26, 2013.
[12]
T. Norrie, N. Patil, D. H. Yoon, G. Kurian, S. Li, J. Laudon, C. Young, N. P. Jouppi, and D. A. Patterson, "Google's Training Chips Revealed: TPUv2 and TPUv3." 08 2020, Hot Chips Symposium, pp. 1--70. 2020.
[13]
P. DeSantis, "Keynote at AWS re:Invent 2021," online https://www.youtube.com/watch?v=9NEQbFLtDmg&t=4105s, 12 2021.
[14]
NVIDIA Corporation, "NVIDIA DGX A100 System Architecture," Tech. Rep. WP-10083-001 v01, 07 2020.
[15]
NVIDIA Corporation, "NVIDIA H100 Tensor Core GPU Architecture," Tech. Rep. V1.01, 03 2022.
[16]
G. Venkataramanan, "Talk at tesla ai day," online https://www.youtube.com/watch?v=j0z4FweCy4M&t=6775s, 08 2021.
[17]
A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishnamurthy, M. Moshref, D. R. K. Ports, and P. Richtárik, "Scaling Distributed Machine Learning with In-Network Aggregation," in Proceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), Apr 2021.
[18]
C. Renggli, D. Alistarh, M. Aghagolzadeh, and T. Hoefler, "SparCML: High-Performance Sparse Communication for Machine Learning," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), Nov. 2019.
[19]
T. Ben-Nun and T. Hoefler, "Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis," ACM Comput. Surv., vol. 52, no. 4, pp. 65:1--65:43, Aug. 2019.
[20]
D. Alistarh, T. Hoefler, M. Johansson, S. Khirirat, N. Konstantinov, and C. Renggli, "The Convergence of Sparsified Gradient Methods," in Advances in Neural Information Processing Systems 31. Curran Associates, Inc., Dec. 2018.
[21]
A. Dieuleveut and K. K. Patel, "Communication trade-offs for local-sgd with large step size," in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019.
[22]
S. U. Stich, "Local sgd converges fast and communicates little," 2019.
[23]
E. Gorbunov, F. Hanzely, and P. Richtárik, "Local sgd: Unified theory and new efficient methods," 2020.
[24]
A. Khaled, K. Mishchenko, and P. Richtárik, "Tighter theory for local sgd on identical and heterogeneous data," 2020.
[25]
N. Dryden, S. A. Jacobs, T. Moon, and B. Van Essen, "Communication quantization for data-parallel training of deep neural networks," in Proceedings of the Workshop on Machine Learning in High Performance Computing Environments, ser. MLHPC '16. IEEE Press, 2016, p. 18.
[26]
J. Wangni, J. Wang, J. Liu, and T. Zhang, "Gradient sparsification for communication-efficient distributed optimization," in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018.
[27]
D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli, "The convergence of sparsified gradient methods," in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018.
[28]
J. Chen, R. Monga, S. Bengio, and R. Jozefowicz, "Revisiting distributed synchronous sgd," in International Conference on Learning Representations Workshop Track, 2016. [Online]. Available: https://arxiv.org/abs/1604.00981
[29]
Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and Z. Chen, "Gpipe: Efficient training of giant neural networks using pipeline parallelism," 2019.
[30]
A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons, "Pipedream: Fast and efficient pipeline parallel dnn training," 2018.
[31]
S. Li and T. Hoefler, "Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC21). ACM, Nov. 2021.
[32]
B. Yang, J. Zhang, J. Li, C. Re, C. Aberger, and C. De Sa, "Pipemare: Asynchronous pipeline parallel dnn training," in Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica, Eds., vol. 3, 2021, pp. 269--296.
[33]
B. Prisacari, G. Rodriguez, P. Heidelberger, D. Chen, C. Minkenberg, and T. Hoefler, "Efficient Task Placement and Routing in Dragonfly Networks," in Proceedings of the 23rd ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC'14). ACM, Jun. 2014.
[34]
M. Besta and T. Hoefler, "Slim Fly: A Cost Effective Low-Diameter Network Topology," Nov. 2014, proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC14).
[35]
B. Prisacari, G. Rodriguez, C. Minkenberg, and T. Hoefler, "Bandwidth-optimal All-to-all Exchanges in Fat Tree Networks," in Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM, Jun. 2013, pp. 139--148.
[36]
J. Kim, W. J. Dally, S. Scott, and D. Abts, "Technology-driven, highly-scalable dragonfly topology," in 2008 International Symposium on Computer Architecture, 2008, pp. 77--88.
[37]
G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodriguez, and T. Hoefler, "Cost-Effective Diameter-Two Topologies: Analysis and Evaluation." ACM, Nov. 2015, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC15).
[38]
F. T. Leighton, Introduction to parallel algorithms and architectures: Arrays, trees, hypercubes. Elsevier, 1991.
[39]
M. R. Garey and D. S. Johnson, Computers and Intractability; A Guide to the Theory of NP-Completeness. USA: W. H. Freeman & Co., 1990.
[40]
"MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters," in 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, WA: USENIX Association, Apr. 2022. [Online]. Available: https://www.usenix.org/conference/nsdi22/presentation/weng
[41]
Alibaba, "Alibaba cluster trace program," 2020, [Online; accessed 04-Mar-2022]. [Online]. Available: \url {https://github.com/alibaba/clusterdata/blob/master/cluster-trace-gpu-v2020/README.md}
[42]
D. De Sensi, S. Di Girolamo, K. H. McMahon, D. Roweth, and T. Hoefler, "An In-Depth Analysis of the Slingshot Interconnect," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20), Nov. 2020.
[43]
NVIDIA Corporation, "NVIDIA InfiniBand Adaptive Routing Technology," Tech. Rep. WP-10326-001_v01, 07 2020.
[44]
S. Sinha, S. Kandula, and D. Katabi, "Harnessing TCPs Burstiness using Flowlet Switching," in 3rd ACM SIGCOMM Workshop on Hot Topics in Networks (HotNets), San Diego, CA, November 2004.
[45]
T. L. Rodeheffer, C. Thacker, A. Birrell, T. Rodeheffer, H. Murray, M. Schroeder, E. Satterthwaite, R. Needham, M. Burrows, M. D. Schroeder, and M. Schroeder, "Autonet: A high-speed, self-configuring local area network using point-to-point links," IEEE Journal on Select Areas of Communication, vol. 9, October 1991.
[46]
C. Glass and L. Ni, "The turn model for adaptive routing," in Proceedings the 19th Annual International Symposium on Computer Architecture, 1992, pp. 278--287.
[47]
H. Adalsteinsson, S. Cranford, D. A. Evensky, J. P. Kenny, J. Mayo, A. Pinar, and C. L. Janssen, "A simulator for large-scale parallel computer architectures," Int. J. Distrib. Syst. Technol., vol. 1, no. 2, p. 5773, apr 2010. [Online].
[48]
M. Barnett, R. Littlefield, D. Payne, and R. Vandegeijn, "Global combine algorithms for 2-d meshes with wormhole routing," J. Parallel Distrib. Comput., vol. 24, no. 2, p. 191201, feb 1995. [Online].
[49]
M. M. Bae, B. F. AlBdaiwi, and B. Bose, "Edge-disjoint hamiltonian cycles in two-dimensional torus," Int. J. Math. Math. Sci., vol. 2004, no. 25, pp. 1299--1308, 2004. [Online].
[50]
M. Cho, U. Finkler, M. Serrano, D. Kung, and H. Hunter, "Blueconnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy," IBM Journal of Research and Development, vol. 63, no. 6, pp. 1:1--1:11, 2019.
[51]
R. Thakur, R. Rabenseifner, and W. Gropp, "Optimization of collective communication operations in mpich," Int. J. High Perform. Comput. Appl., vol. 19, no. 1, p. 4966, feb 2005. [Online].
[52]
T. Hoefler, A. Lumsdaine, and W. Rehm, "Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI," in Proceedings of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM, Nov. 2007.
[53]
S. Rashidi, M. Denton, S. Sridharan, S. Srinivasan, A. Suresh, J. Nie, and T. Krishna, Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms. IEEE Press, 2021, p. 540--553. [Online].
[54]
N. Dryden, N. Maruyama, T. Moon, T. Benson, A. Yoo, M. Snir, and B. Van Essen, "Aluminum: An asynchronous, gpu-aware communication library optimized for large-scale training of deep neural networks on hpc systems," in 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC), 2018, pp. 1--13.
[55]
Y. You, I. Gitman, and B. Ginsburg, "Large batch training of convolutional networks," 2017.
[56]
A. Mathuriya, D. Bard, P. Mendygral, L. Meadows, J. Arnemann, L. Shao, S. He, T. Karna, D. Moise, S. J. Pennycook, K. Maschoff, J. Sewall, N. Kumar, S. Ho, M. Ringenburg, Prabhat, and V. Lee, "Cosmoflow: Using deep learning to learn the universe at scale," 2018.
[57]
J. A. Yang, J. Park, S. Sridharan, and P. T. P. Tang, "Training deep learning recommendation model with quantized collective communications," in Conference on Knowledge Discovery and Data Mining (KDD), 2020.
[58]
M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini et al., "Deep learning recommendation model for personalization and recommendation systems," arXiv preprint arXiv:1906.00091, 2019.
[59]
U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jia et al., "The architectural implications of facebook's dnn-based personalized recommendation," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 488--501.
[60]
D. Mudigere, Y. Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Park et al., "Software-hardware co-design for fast and scalable training of deep learning recommendation models," arXiv preprint arXiv:2104.05158, 2021.
[61]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, "Language models are few-shot learners," 2020.
[62]
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, "Megatron-lm: Training multi-billion parameter language models using model parallelism," 2020.
[63]
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, "Gshard: Scaling giant models with conditional computation and automatic sharding," 2020. [Online]. Available: https://arxiv.org/abs/2006.16668
[64]
B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony, "The PERCS High-Performance Interconnect," in Proceedings of 18th Symposium on High-Performance Interconnects (Hot Interconnects 2010). IEEE, Aug. 2010.
[65]
J. Kim, W. J. Dally, and D. Abts, "Flattened butterfly: A cost-efficient topology for high-radix networks," in Proceedings of the 34th Annual International Symposium on Computer Architecture, ser. ISCA '07. New York, NY, USA: Association for Computing Machinery, 2007, p. 126--137. [Online].
[66]
J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber, "Hyperx: topology, routing, and packaging of efficient large-scale networks," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009, pp. 1--11.
[67]
K. D. Underwood and E. Borch, "Exploiting communication and packaging locality for cost-effective large scale networks," in Proceedings of the 26th ACM International Conference on Supercomputing, ser. ICS '12. New York, NY, USA: Association for Computing Machinery, 2012, p. 291300. [Online].
[68]
B. Bode, M. Butler, T. Dunning, T. Hoefler, W. Kramer, W. Gropp, and H. Wen-Mei, The blue waters super-system for super-science. CRC Press, Jan. 2013, pp. 339--366.
[69]
N. Adiga, G. Almasi, G. Almasi, Y. Aridor, R. Barik, D. Beece, R. Bellofatto, G. Bhanot, R. Bickford, M. Blumrich, A. Bright, J. Brunheroto, C. Cascaval, J. Castanos, W. Chan, L. Ceze, P. Coteus, S. Chatterjee, D. Chen, G. Chiu, T. Cipolla, P. Crumley, K. Desai, A. Deutsch, T. Domany, M. Dombrowa, W. Donath, M. Eleftheriou, C. Erway, J. Esch, B. Fitch, J. Gagliano, A. Gara, R. Garg, R. Germain, M. Giampapa, B. Gopalsamy, J. Gunnels, M. Gupta, F. Gustavson, S. Hall, R. Haring, D. Heidel, P. Heidelberger, L. Herger, D. Hoenicke, R. Jackson, T. Jamal-Eddine, G. Kopcsay, E. Krevat, M. Kurhekar, A. Lanzetta, D. Lieber, L. Liu, M. Lu, M. Mendell, A. Misra, Y. Moatti, L. Mok, J. Moreira, B. Nathanson, M. Newton, M. Ohmacht, A. Oliner, V. Pandit, R. Pudota, R. Rand, R. Regan, B. Rubin, A. Ruehli, S. Rus, R. Sahoo, A. Sanomiya, E. Schenfeld, M. Sharma, E. Shmueli, S. Singh, P. Song, V. Srinivasan, B. Steinmacher-Burow, K. Strauss, C. Surovic, R. Swetz, T. Takken, R. Tremaine, M. Tsao, A. Umamaheshwaran, P. Verma, P. Vranas, T. Ward, M. Wazlowski, W. Barrett, C. Engel, B. Drehmel, B. Hilgart, D. Hill, F. Kasemkhani, D. Krolak, C. Li, T. Liebsch, J. Marcella, A. Muff, A. Okomo, M. Rouse, A. Schram, M. Tubbs, G. Ulsh, C. Wait, J. Wittrup, M. Bae, K. Dockser, L. Kissel, M. Seager, J. Vetter, and K. Yates, "An overview of the bluegene/l supercomputer," in SC '02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, 2002, pp. 60--60.
[70]
N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, and D. Patterson, "A domain-specific supercomputer for training deep neural networks," Commun. ACM, vol. 63, no. 7, p. 67--78, jun 2020. [Online].
[71]
Yuichiro Ajima, "High-dimensional Interconnect Technology for the K Computer and the Supercomputer Fugaku," Tech. Rep., 06 2019, fujitsu Technical Review.
[72]
T. T. Nguyen and M. Wahib, "An allreduce algorithm and network co-design for large-scale training of distributed deep learning," in 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), 2021, pp. 396--405.
[73]
S. Wang, D. Li, Y. Cheng, J. Geng, Y. Wang, S. Wang, S. Xia, and J. Wu, "A scalable, high-performance, and fault-tolerant network architecture for distributed machine learning," IEEE/ACM Transactions on Networking, vol. 28, no. 4, pp. 1752--1764, 2020.
[74]
T. Häner and D. S. Steiger, "0.5 petabyte simulation of a 45-qubit quantum circuit," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '17. New York, NY, USA: Association for Computing Machinery, 2017. [Online].
[75]
G. Kwasniewski, M. Kabic, T. Ben-Nun, A. N. Ziogas, J. E. Saethre, A. Gaillard, T. Schneider, M. Besta, A. Kozhevnikov, J. VandeVondele, and T. Hoefler, "On the parallel i/o optimality of linear algebra kernels: Near-optimal matrix factorizations," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '21. New York, NY, USA: Association for Computing Machinery, 2021. [Online].
[76]
R. L. Graham, D. Bureddy, P. Lui, H. Rosenstock, G. Shainer, G. Bloch, D. Goldenerg, M. Dubman, S. Kotchubievsky, V. Koushnir, L. Levi, A. Margolin, T. Ronen, A. Shpiner, O. Wertheim, and E. Zahavi, "Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction," in Proceedings of COMHPC 2016: 1st Workshop on Optimization of Communication in HPC Runtime Systems - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers Inc., jan 2017, pp. 1--10.
[77]
R. L. Graham, L. Levi, D. Burredy, G. Bloch, G. Shainer, D. Cho, G. Elias, D. Klein, J. Ladd, O. Maor, A. Marelli, V. Petrov, E. Romlet, Y. Qin, and I. Zemah, "Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation," in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12151 LNCS. Springer, jun 2020, pp. 41--59. [Online].
[78]
D. De Sensi, S. Di Girolamo, S. Ashkboos, S. Li, and T. Hoefler, "Flare: Flexible in-network allreduce," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '21, 2021.
[79]
B. Klenk, N. Jiang, G. Thorson, and L. Dennison, "An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives," 2020.
[80]
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, "Zero: Memory optimizations toward training trillion parameter models," 2019. [Online]. Available: https://arxiv.org/abs/1910.02054
[81]
Z. Jia, M. Zaharia, and A. Aiken, "Beyond data and model parallelism for deep neural networks," 2018. [Online]. Available: https://arxiv.org/abs/1807.05358
[82]
T. Hoefler, M. C. Heddes, and J. R. Belk, "Distributed processing architecture," U.S. Patent US11 076 210B1, Jul., 2021.
[83]
T. Hoefler, M. C. Heddes, D. Goel, and J. R. Belk, "Distributed processing architecture," U.S. Patent US20 210 209 460A1, Jul., 2021.

Recommendations

Comments

Information & Contributors

Information

Published In

SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2022
1277 pages
ISBN:9784665454445

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 18 November 2022

Check for updates

Badges

Author Tags

  1. clusters
  2. deep learning
  3. network architecture
  4. software defined networking

Qualifiers

  • Research-article

Conference

SC '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 476
    Total Downloads
  • Downloads (Last 12 months)235
  • Downloads (Last 6 weeks)18
Reflects downloads up to 27 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media