research-article

HammingMesh: a network topology for large-scale deep learning

Authors:

Torsten Hoefler,

Tommaso Bonato,

Daniele De Sensi,

Salvatore Di Girolamo,

Steve ScottAuthors Info & Claims

SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 11, Pages 1 - 18

Published: 18 November 2022 Publication History

Abstract

Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep learning systems with extreme bandwidth requirements.

Supplementary Material

MP4 File (SC22_Presentation_Hoefler.mp4)

Presentation at SC '22

Download
260.82 MB

References

[1]

A. Karpathy, "Software 2.0," November 2017, [Online; posted 11-Nov-2017]. [Online]. Available: https://karpathy.medium.com/software-2-0-a64152b37c35

[2]

D. H. Dario Amodei, "Ai and compute," online https://openai.com/blog/ai-and-compute/, 05 2018.

[3]

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, "Scaling laws for neural language models," 2020.

[4]

A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, "Data Movement Is All You Need: A Case Study on Optimizing Transformers," in Proceedings of Machine Learning and Systems 3 (MLSys 2021), Apr. 2021.

[5]

NVIDIA Corporation, "NVIDIA Tesla V100 GPU Architecture," Tech. Rep. WP-08608-001 v1.1, 08 2017.

[6]

Xilinx Corporation, "Xilinx AI Engines and Their Applications," Tech. Rep. WP506 (v1.1) July 10, 2020, 07 2020.

[7]

B. Darvish Rouhani, D. Lo, R. Zhao, M. Liu, J. Fowers, K. Ovtcharov, A. Vinogradsky, S. Massengill, L. Yang, R. Bittner, A. Forin, H. Zhu, T. Na, P. Patel, S. Che, L. Chand Koppaka, X. SONG, S. Som, K. Das, S. T, S. Reinhardt, S. Lanka, E. Chung, and D. Burger, "Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point," in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 10271--10 281.

[8]

V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, "Efficient processing of deep neural networks: A tutorial and survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295--2329, 2017.

[9]

A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, "Survey of machine learning accelerators," in 2020 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, sep 2020. [Online]. Available: https://doi.org/10.1109%2Fhpec43674.2020.9286149

[10]

T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, "Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks," Journal of Machine Learning Research, vol. 22, no. 241, pp. 1--124, Sep. 2021.

[11]

P. Kogge and J. Shalf, "Exascale computing trends: Adjusting to the "new normal"' for computer architecture," Computing in Science Engineering, vol. 15, no. 6, pp. 16--26, 2013.

Digital Library

[12]

T. Norrie, N. Patil, D. H. Yoon, G. Kurian, S. Li, J. Laudon, C. Young, N. P. Jouppi, and D. A. Patterson, "Google's Training Chips Revealed: TPUv2 and TPUv3." 08 2020, Hot Chips Symposium, pp. 1--70. 2020.

[13]

P. DeSantis, "Keynote at AWS re:Invent 2021," online https://www.youtube.com/watch?v=9NEQbFLtDmg&t=4105s, 12 2021.

[14]

NVIDIA Corporation, "NVIDIA DGX A100 System Architecture," Tech. Rep. WP-10083-001 v01, 07 2020.

[15]

NVIDIA Corporation, "NVIDIA H100 Tensor Core GPU Architecture," Tech. Rep. V1.01, 03 2022.

[16]

G. Venkataramanan, "Talk at tesla ai day," online https://www.youtube.com/watch?v=j0z4FweCy4M&t=6775s, 08 2021.

[17]

A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishnamurthy, M. Moshref, D. R. K. Ports, and P. Richtárik, "Scaling Distributed Machine Learning with In-Network Aggregation," in Proceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), Apr 2021.

[18]

C. Renggli, D. Alistarh, M. Aghagolzadeh, and T. Hoefler, "SparCML: High-Performance Sparse Communication for Machine Learning," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), Nov. 2019.

[19]

T. Ben-Nun and T. Hoefler, "Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis," ACM Comput. Surv., vol. 52, no. 4, pp. 65:1--65:43, Aug. 2019.

Digital Library

[20]

D. Alistarh, T. Hoefler, M. Johansson, S. Khirirat, N. Konstantinov, and C. Renggli, "The Convergence of Sparsified Gradient Methods," in Advances in Neural Information Processing Systems 31. Curran Associates, Inc., Dec. 2018.

[21]

A. Dieuleveut and K. K. Patel, "Communication trade-offs for local-sgd with large step size," in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019.

[22]

S. U. Stich, "Local sgd converges fast and communicates little," 2019.

[23]

E. Gorbunov, F. Hanzely, and P. Richtárik, "Local sgd: Unified theory and new efficient methods," 2020.

[24]

A. Khaled, K. Mishchenko, and P. Richtárik, "Tighter theory for local sgd on identical and heterogeneous data," 2020.

[25]

N. Dryden, S. A. Jacobs, T. Moon, and B. Van Essen, "Communication quantization for data-parallel training of deep neural networks," in Proceedings of the Workshop on Machine Learning in High Performance Computing Environments, ser. MLHPC '16. IEEE Press, 2016, p. 18.

[26]

J. Wangni, J. Wang, J. Liu, and T. Zhang, "Gradient sparsification for communication-efficient distributed optimization," in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018.

[27]

D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli, "The convergence of sparsified gradient methods," in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018.

[28]

J. Chen, R. Monga, S. Bengio, and R. Jozefowicz, "Revisiting distributed synchronous sgd," in International Conference on Learning Representations Workshop Track, 2016. [Online]. Available: https://arxiv.org/abs/1604.00981

[29]

Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and Z. Chen, "Gpipe: Efficient training of giant neural networks using pipeline parallelism," 2019.

[30]

A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons, "Pipedream: Fast and efficient pipeline parallel dnn training," 2018.

[31]

S. Li and T. Hoefler, "Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC21). ACM, Nov. 2021.

[32]

B. Yang, J. Zhang, J. Li, C. Re, C. Aberger, and C. De Sa, "Pipemare: Asynchronous pipeline parallel dnn training," in Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica, Eds., vol. 3, 2021, pp. 269--296.

[33]

B. Prisacari, G. Rodriguez, P. Heidelberger, D. Chen, C. Minkenberg, and T. Hoefler, "Efficient Task Placement and Routing in Dragonfly Networks," in Proceedings of the 23rd ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC'14). ACM, Jun. 2014.

[34]

M. Besta and T. Hoefler, "Slim Fly: A Cost Effective Low-Diameter Network Topology," Nov. 2014, proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC14).

[35]

B. Prisacari, G. Rodriguez, C. Minkenberg, and T. Hoefler, "Bandwidth-optimal All-to-all Exchanges in Fat Tree Networks," in Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM, Jun. 2013, pp. 139--148.

[36]

J. Kim, W. J. Dally, S. Scott, and D. Abts, "Technology-driven, highly-scalable dragonfly topology," in 2008 International Symposium on Computer Architecture, 2008, pp. 77--88.

[37]

G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodriguez, and T. Hoefler, "Cost-Effective Diameter-Two Topologies: Analysis and Evaluation." ACM, Nov. 2015, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC15).

[38]

F. T. Leighton, Introduction to parallel algorithms and architectures: Arrays, trees, hypercubes. Elsevier, 1991.

Digital Library

[39]

M. R. Garey and D. S. Johnson, Computers and Intractability; A Guide to the Theory of NP-Completeness. USA: W. H. Freeman & Co., 1990.

Digital Library

[40]

"MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters," in 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, WA: USENIX Association, Apr. 2022. [Online]. Available: https://www.usenix.org/conference/nsdi22/presentation/weng

[41]

Alibaba, "Alibaba cluster trace program," 2020, [Online; accessed 04-Mar-2022]. [Online]. Available: \url {https://github.com/alibaba/clusterdata/blob/master/cluster-trace-gpu-v2020/README.md}

[42]

D. De Sensi, S. Di Girolamo, K. H. McMahon, D. Roweth, and T. Hoefler, "An In-Depth Analysis of the Slingshot Interconnect," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20), Nov. 2020.

[43]

NVIDIA Corporation, "NVIDIA InfiniBand Adaptive Routing Technology," Tech. Rep. WP-10326-001_v01, 07 2020.

[44]

S. Sinha, S. Kandula, and D. Katabi, "Harnessing TCPs Burstiness using Flowlet Switching," in 3rd ACM SIGCOMM Workshop on Hot Topics in Networks (HotNets), San Diego, CA, November 2004.

[45]

T. L. Rodeheffer, C. Thacker, A. Birrell, T. Rodeheffer, H. Murray, M. Schroeder, E. Satterthwaite, R. Needham, M. Burrows, M. D. Schroeder, and M. Schroeder, "Autonet: A high-speed, self-configuring local area network using point-to-point links," IEEE Journal on Select Areas of Communication, vol. 9, October 1991.

[46]

C. Glass and L. Ni, "The turn model for adaptive routing," in Proceedings the 19th Annual International Symposium on Computer Architecture, 1992, pp. 278--287.

[47]

H. Adalsteinsson, S. Cranford, D. A. Evensky, J. P. Kenny, J. Mayo, A. Pinar, and C. L. Janssen, "A simulator for large-scale parallel computer architectures," Int. J. Distrib. Syst. Technol., vol. 1, no. 2, p. 5773, apr 2010. [Online].

Digital Library

[48]

M. Barnett, R. Littlefield, D. Payne, and R. Vandegeijn, "Global combine algorithms for 2-d meshes with wormhole routing," J. Parallel Distrib. Comput., vol. 24, no. 2, p. 191201, feb 1995. [Online].

Digital Library

[49]

M. M. Bae, B. F. AlBdaiwi, and B. Bose, "Edge-disjoint hamiltonian cycles in two-dimensional torus," Int. J. Math. Math. Sci., vol. 2004, no. 25, pp. 1299--1308, 2004. [Online].

[50]

M. Cho, U. Finkler, M. Serrano, D. Kung, and H. Hunter, "Blueconnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy," IBM Journal of Research and Development, vol. 63, no. 6, pp. 1:1--1:11, 2019.

[51]

R. Thakur, R. Rabenseifner, and W. Gropp, "Optimization of collective communication operations in mpich," Int. J. High Perform. Comput. Appl., vol. 19, no. 1, p. 4966, feb 2005. [Online].

Digital Library

[52]

T. Hoefler, A. Lumsdaine, and W. Rehm, "Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI," in Proceedings of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM, Nov. 2007.

[53]

S. Rashidi, M. Denton, S. Sridharan, S. Srinivasan, A. Suresh, J. Nie, and T. Krishna, Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms. IEEE Press, 2021, p. 540--553. [Online].

Digital Library

[54]

N. Dryden, N. Maruyama, T. Moon, T. Benson, A. Yoo, M. Snir, and B. Van Essen, "Aluminum: An asynchronous, gpu-aware communication library optimized for large-scale training of deep neural networks on hpc systems," in 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC), 2018, pp. 1--13.

[55]

Y. You, I. Gitman, and B. Ginsburg, "Large batch training of convolutional networks," 2017.

[56]

A. Mathuriya, D. Bard, P. Mendygral, L. Meadows, J. Arnemann, L. Shao, S. He, T. Karna, D. Moise, S. J. Pennycook, K. Maschoff, J. Sewall, N. Kumar, S. Ho, M. Ringenburg, Prabhat, and V. Lee, "Cosmoflow: Using deep learning to learn the universe at scale," 2018.

[57]

J. A. Yang, J. Park, S. Sridharan, and P. T. P. Tang, "Training deep learning recommendation model with quantized collective communications," in Conference on Knowledge Discovery and Data Mining (KDD), 2020.

[58]

M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini et al., "Deep learning recommendation model for personalization and recommendation systems," arXiv preprint arXiv:1906.00091, 2019.

[59]

U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jia et al., "The architectural implications of facebook's dnn-based personalized recommendation," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 488--501.

[60]

D. Mudigere, Y. Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Park et al., "Software-hardware co-design for fast and scalable training of deep learning recommendation models," arXiv preprint arXiv:2104.05158, 2021.

[61]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, "Language models are few-shot learners," 2020.

[62]

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, "Megatron-lm: Training multi-billion parameter language models using model parallelism," 2020.

[63]

D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, "Gshard: Scaling giant models with conditional computation and automatic sharding," 2020. [Online]. Available: https://arxiv.org/abs/2006.16668

[64]

B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony, "The PERCS High-Performance Interconnect," in Proceedings of 18th Symposium on High-Performance Interconnects (Hot Interconnects 2010). IEEE, Aug. 2010.

[65]

J. Kim, W. J. Dally, and D. Abts, "Flattened butterfly: A cost-efficient topology for high-radix networks," in Proceedings of the 34th Annual International Symposium on Computer Architecture, ser. ISCA '07. New York, NY, USA: Association for Computing Machinery, 2007, p. 126--137. [Online].

Digital Library

[66]

J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber, "Hyperx: topology, routing, and packaging of efficient large-scale networks," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009, pp. 1--11.

[67]

K. D. Underwood and E. Borch, "Exploiting communication and packaging locality for cost-effective large scale networks," in Proceedings of the 26th ACM International Conference on Supercomputing, ser. ICS '12. New York, NY, USA: Association for Computing Machinery, 2012, p. 291300. [Online].

Digital Library

[68]

B. Bode, M. Butler, T. Dunning, T. Hoefler, W. Kramer, W. Gropp, and H. Wen-Mei, The blue waters super-system for super-science. CRC Press, Jan. 2013, pp. 339--366.

[69]

N. Adiga, G. Almasi, G. Almasi, Y. Aridor, R. Barik, D. Beece, R. Bellofatto, G. Bhanot, R. Bickford, M. Blumrich, A. Bright, J. Brunheroto, C. Cascaval, J. Castanos, W. Chan, L. Ceze, P. Coteus, S. Chatterjee, D. Chen, G. Chiu, T. Cipolla, P. Crumley, K. Desai, A. Deutsch, T. Domany, M. Dombrowa, W. Donath, M. Eleftheriou, C. Erway, J. Esch, B. Fitch, J. Gagliano, A. Gara, R. Garg, R. Germain, M. Giampapa, B. Gopalsamy, J. Gunnels, M. Gupta, F. Gustavson, S. Hall, R. Haring, D. Heidel, P. Heidelberger, L. Herger, D. Hoenicke, R. Jackson, T. Jamal-Eddine, G. Kopcsay, E. Krevat, M. Kurhekar, A. Lanzetta, D. Lieber, L. Liu, M. Lu, M. Mendell, A. Misra, Y. Moatti, L. Mok, J. Moreira, B. Nathanson, M. Newton, M. Ohmacht, A. Oliner, V. Pandit, R. Pudota, R. Rand, R. Regan, B. Rubin, A. Ruehli, S. Rus, R. Sahoo, A. Sanomiya, E. Schenfeld, M. Sharma, E. Shmueli, S. Singh, P. Song, V. Srinivasan, B. Steinmacher-Burow, K. Strauss, C. Surovic, R. Swetz, T. Takken, R. Tremaine, M. Tsao, A. Umamaheshwaran, P. Verma, P. Vranas, T. Ward, M. Wazlowski, W. Barrett, C. Engel, B. Drehmel, B. Hilgart, D. Hill, F. Kasemkhani, D. Krolak, C. Li, T. Liebsch, J. Marcella, A. Muff, A. Okomo, M. Rouse, A. Schram, M. Tubbs, G. Ulsh, C. Wait, J. Wittrup, M. Bae, K. Dockser, L. Kissel, M. Seager, J. Vetter, and K. Yates, "An overview of the bluegene/l supercomputer," in SC '02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, 2002, pp. 60--60.

[70]

N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, and D. Patterson, "A domain-specific supercomputer for training deep neural networks," Commun. ACM, vol. 63, no. 7, p. 67--78, jun 2020. [Online].

Digital Library

[71]

Yuichiro Ajima, "High-dimensional Interconnect Technology for the K Computer and the Supercomputer Fugaku," Tech. Rep., 06 2019, fujitsu Technical Review.

[72]

T. T. Nguyen and M. Wahib, "An allreduce algorithm and network co-design for large-scale training of distributed deep learning," in 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), 2021, pp. 396--405.

[73]

S. Wang, D. Li, Y. Cheng, J. Geng, Y. Wang, S. Wang, S. Xia, and J. Wu, "A scalable, high-performance, and fault-tolerant network architecture for distributed machine learning," IEEE/ACM Transactions on Networking, vol. 28, no. 4, pp. 1752--1764, 2020.

Digital Library

[74]

T. Häner and D. S. Steiger, "0.5 petabyte simulation of a 45-qubit quantum circuit," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '17. New York, NY, USA: Association for Computing Machinery, 2017. [Online].

Digital Library

[75]

G. Kwasniewski, M. Kabic, T. Ben-Nun, A. N. Ziogas, J. E. Saethre, A. Gaillard, T. Schneider, M. Besta, A. Kozhevnikov, J. VandeVondele, and T. Hoefler, "On the parallel i/o optimality of linear algebra kernels: Near-optimal matrix factorizations," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '21. New York, NY, USA: Association for Computing Machinery, 2021. [Online].

Digital Library

[76]

R. L. Graham, D. Bureddy, P. Lui, H. Rosenstock, G. Shainer, G. Bloch, D. Goldenerg, M. Dubman, S. Kotchubievsky, V. Koushnir, L. Levi, A. Margolin, T. Ronen, A. Shpiner, O. Wertheim, and E. Zahavi, "Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction," in Proceedings of COMHPC 2016: 1st Workshop on Optimization of Communication in HPC Runtime Systems - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers Inc., jan 2017, pp. 1--10.

[77]

R. L. Graham, L. Levi, D. Burredy, G. Bloch, G. Shainer, D. Cho, G. Elias, D. Klein, J. Ladd, O. Maor, A. Marelli, V. Petrov, E. Romlet, Y. Qin, and I. Zemah, "Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation," in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12151 LNCS. Springer, jun 2020, pp. 41--59. [Online].

Digital Library

[78]

D. De Sensi, S. Di Girolamo, S. Ashkboos, S. Li, and T. Hoefler, "Flare: Flexible in-network allreduce," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '21, 2021.

[79]

B. Klenk, N. Jiang, G. Thorson, and L. Dennison, "An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives," 2020.

Digital Library

[80]

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, "Zero: Memory optimizations toward training trillion parameter models," 2019. [Online]. Available: https://arxiv.org/abs/1910.02054

[81]

Z. Jia, M. Zaharia, and A. Aiken, "Beyond data and model parallelism for deep neural networks," 2018. [Online]. Available: https://arxiv.org/abs/1807.05358

[82]

T. Hoefler, M. C. Heddes, and J. R. Belk, "Distributed processing architecture," U.S. Patent US11 076 210B1, Jul., 2021.

[83]

T. Hoefler, M. C. Heddes, D. Goel, and J. R. Belk, "Distributed processing architecture," U.S. Patent US20 210 209 460A1, Jul., 2021.

Recommendations

Flow splitting for end-to-end proportional QoS in OBS networks

In this paper, we propose probabilistic splitting of a packet stream at the edge routers as the basic method to provide end-to-end proportional QoS to packet flows carried through an OBS network, in terms of loss probability. We argue that the only ...
Best-effort versus reservations revisited
IWQoS'05: Proceedings of the 13th international conference on Quality of Service

In this paper, we walk in the footsteps of the stimulating paper by Lee Breslau and Scott Shenker entitled “Best-effort vs. Reservations: A Simple Comparative Analysis”[1]. In fact, we finally follow their invitation to use their models as a starting ...
Near-optimal responsive traffic engineering in software defined networks based on deep learning
Abstract
The routing problem for traffic engineering can be solved using different techniques. For example, the problem can be formulated as a linear program (LP) or a mixed-integer linear program (MILP) that requires solving a complex ...
Highlights
- MLP and LSTM neural networks can learn traffic splitting and achieve near-optimal routing.

Comments

Information & Contributors

Information

Published In

SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2022

1277 pages

ISBN:9784665454445

Conference Chairs:
Felix Wolf,
Sameer Shende,
General Chair:
Candace Culhane,
Program Chairs:
Sadaf Alam,
Heike Jagode

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 18 November 2022

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

SC '22

Sponsor:

SIGHPC

SC '22: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2022

Texas, Dallas

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
476
Total Downloads

Downloads (Last 12 months)235
Downloads (Last 6 weeks)18

Reflects downloads up to 27 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents