Didc LBNL
Didc LBNL
net/publication/254005699
CITATIONS READS
10 642
9 authors, including:
All content following this page was uploaded by Mr Prabhat on 23 November 2016.
SC11
ABSTRACT NERSC
ANL STAR AOFA
SALT
SUNN
100Gbps networking has finally arrived, and many research and
ORNL
educational institutions have begun to deploy 100Gbps routers and
services. ESnet and Internet2 worked together to make 100Gbps
networks available to researchers at the Supercomputing 2011 con- Router site
Optical regen site
ference in Seattle Washington. In this paper, we describe two of
the first applications to take advantage of this network. We demon-
strate a visualization application that enables remotely located sci- Figure 1: 100Gbps Network for Supercomputing 2011
entists to gain insights from large datasets. We also demonstrate
climate data movement and analysis over the 100Gbps network.
We describe a number of application design issues and host tuning
strategies necessary for enabling applications to scale to 100Gbps times faster just because there was more bandwidth available. The
rates. same is true today with the leap for 10Gbps to 100Gbps networks.
One needs to pay close attention to application design and host tun-
Categories and Subject Descriptors ing in order to be able to take advantage of the higher network ca-
C.2.5 [Local and Wide-Area Networks]: High-speed pacity. Some of these issues are similar to those of 10 years ago,
; C.2.4 [Distributed Systems]: Distributed applications such as I/O pipelining and TCP tuning, but some are different due
to the fact that we have many more CPU cores involved.
ESnet and Internet2, the two largest research and education net-
General Terms work providers in the USA, worked together to make 100Gbps net-
Performance works available to researchers at the Supercomputing 2011 (SC11)
conference in Seattle Washington, November 2011. This network,
Keywords shown in Figure 1, included a 100Gbps connection between Na-
100Gbps Networking, Data Intensive Distributed Computing, Data tional Energy Research Scientific Computing Center (NERSC) at
Movement, Visualization Lawrence Berkeley National Laboratory (LBNL) in Oakland, CA,
Argonne National Laboratory (ANL) near Chicago, IL, and Oak
Ridge National Laboratory (ORNL) in Tennessee.
1. INTRODUCTION In this paper, we describe two of the first applications to take
Modern scientific simulations and experiments produce an un- advantage of this network. The first application demonstrates real-
precedented amount of data. End-to-end infrastructure is required time streaming and visualization of a 600 Gigabyte cosmology
to store, transfer and analyze these datasets to gain scientific in- dataset. We illustrate how enhanced network capability enables re-
sights. While there has been a lot of progress in computational motely located scientists to gain insights from large data volumes.
hardware, distributed applications have been hampered by the lack The second application showcases data distribution for climate sci-
of high-speed networks. Today, we have finally crossed the barrier ence. We demonstrate how scientific data movement and analy-
of 100Gbps networking; these networks are increasingly becoming sis between geographically disparate supercomputing facilities can
available to researchers, opening up new avenues for tackling large benefit from high-bandwidth networks.
data challenges. The paper is organized as follows. First, we briefly review back-
When we made a similar leap from 1Gbps to 10Gbps about 10 ground information, and provide details about our testbed config-
years ago, distributed applications did not automatically run 10 uration for the SC11 demonstrations. Next, we provide technical
information and optimization strategies utilized in the visualiza-
Copyright 2012 Association for Computing Machinery. ACM acknowl- tion demo. We then describe a climate data movement application
edges that this contribution was authored or co-authored by an employee, and introduce a data streaming tool for high-bandwidth networks.
contractor or affiliate of the U.S. Government. As such, the Government re- We describe how the application design needed to be modified to
tains a nonexclusive, royalty-free right to publish or reproduce this article, scale to 100Gbps. We then discuss a number of Linux host tuning
or to allow others to do so, for Government purposes only.
DIDC’12, June 18, 2012, Delft, The Netherlands. strategies needed to achieve these rates. Finally, we state lessons
Copyright 2012 ACM 978-1-4503-1341-4/12/06 ...$10.00. learned in end-system configuration and application design to fully
utilize underlying network capacity and conclude with brief evalu-
ation and future directions in use of 100Gbps networks. their local workstations or facilities, and conduct a broad range of
visualization and analysis tasks locally. With the availability of the
100Gbps network, this mode of analysis is now feasible. To justify
2. BACKGROUND this claim, we demonstrate real-time streaming of a large multi-
Terabyte sized dataset in a few minutes from DOE’s production
2.1 The Need for 100Gbps Networks supercomputing facility NERSC, to four commodity workstations
Modern science is increasingly data-driven and collaborative in at SC11 in Seattle. For illustration purposes, we then demonstrate
nature. Large-scale simulations and instruments produce petabytes real-time parallel visualization of the same dataset.
of data, which is subsequently analyzed by tens to thousands of
geographically dispersed scientists. Although it might seem logi-
cal and efficient to collocate the analysis resources with the source
of the data (instrument or a computational cluster), this is not the
likely scenario. Distributed solutions – in which components are
scattered geographically – are much more common at this scale, for
a variety of reasons, and the largest collaborations are most likely
to depend on distributed architectures.
The Large Hadron Collider1 (LHC), the most well-known high-
energy physics collaboration, was a driving force in the deployment
of high bandwidth connections in the research and education world.
Early on, the LHC community understood the challenges presented
by their extraordinary instrument in terms of data generation, dis-
tribution, and analysis.
Many other research disciplines are now facing the same chal-
lenges. The cost of genomic sequencing is falling dramatically,
for example, and the volume of data produced by sequencers is
rising exponentially. In climate science, researchers must analyze
observational and simulation data sets located at facilities around
the world. Climate data is expected to exceed 100 exabytes by
Figure 2: SC11 100Gbps Demo Configuration
2020 [5]. The need for productive access to such data led to the de-
velopment of the Earth System Grid2 (ESG) [9], a global workflow
infrastructure giving climate scientists access to data sets housed at
modeling centers on multiple continents, including North America, 3. 100Gbps TEST ENVIRONMENT
Europe, Asia, and Australia. We performed our tests using a wide array of resources from
Efficient tools are necessary to move vast amounts of scientific DOE’s Advanced Network Initiative 4 (ANI) network and testbed5 ,
data over high-bandwidth networks, for such state-of-the-art col- and the DOE Magellan Project [16]. The ANI Network is a pro-
laborations. We evaluate climate data distribution over high-latency totype 100Gbps network connecting DOE’s three Supercomputer
high-bandwidth networks, and state the necessary steps to scale-up centers. These three centers include National Energy Research Sci-
climate data movement to 100Gbps networks. We have developed entific Computing Center6 (NERSC), Argonne Leadership Class
a new data streaming tool that provides dynamic data channel man- Facility7 , and Oak Ridge Leadership Class Facility8 . The ANI
agement and on-the-fly data pipelines for fast and efficient data ac- Testbed includes high-speed hosts at both NERSC and ALCF. The
cess. Data is treated as first-class citizen for the entire spectrum Magellan project included large clusters at both NERSC and ALCF.
of file sizes, without compromising on optimum usage of network 16 hosts at NERSC were designated as I/O nodes to be connected
bandwidth. In our demonstration, we successfully staged real- to the 100Gbps ANI network.
world data from the Intergovernmental Panel on Climate Change In the annual Supercomputing conference (SC11) held in Seattle,
(IPCC) Fourth Assessment Report (AR4) Phase 3, Coupled Model WA, ESnet9 and Internet210 worked together to bring a 100Gbps
Intercomparison Project3 (CMIP-3) into computing nodes across link from Seattle to Salt Lake City, where it was connected to ES-
the country at ANL and ORNL from NERSC data storage over the net’s ANI network, as shown in Figure 1.
100Gbps network in real-time. We utilized 16 hosts at NERSC to send data, each with 2 quad-
core Intel Nehalem processors and 48 GB of system memory. In
2.2 Visualization over 100Gbps addition to regular disk-based GPFS file system, these hosts are
Modern simulations produce massive amounts of datasets that also connected via Infiniband to a Flash-based file system for sus-
need further analysis and visualization. Often, these datasets can- tained I/O performance during the demonstration. The complete
not be moved from the machines that the simulations are conducted system, including the hosts and the GPFS11 filesystem can sustain
on. One has to resort to in situ analysis (i.e. conduct analysis while 4
the simulation is running), or remote rendering (i.e. run a client Advanced Network Initiative http://www.es.net/RandD/advanced-
networking-initiative/
on a local workstation, and render the data at the supercomputing 5
ANI Testbed http://sites.google.com/a/lbl.gov/ani-testbed/
center). While these modes of operation are often desirable, a class 6
National Energy Research Center http://www.nersc.gov
of researchers would much rather prefer to stream the datasets to 7
Argonne Leadership Class Facility http://www.alcf.anl.gov
1 8
The Large Hadron Collider http://lhc.web.cern.ch/lhc/ Oak Ridge Leadership Class Facility http://www.olcf.ornl.gov
2 9
Earth System Grid http://www.earthsystemgrid.org Energy Sciences Network http://www.es.net
3 10
CMIP3 Multi-Model Dataset Archive at PCMDI http://www- Internet2 http://www.internet2.edu
11
pcmdi.llnl.gov/ipcc/ GPFS http://www.ibm.com/systems/software/gpfs
an aggregated 16 GBytes/second read performance. Each host is
equipped with a Chelsio 10Gbps NIC which is connected to the
NERSC Alcatel-Lucent router. one-way latency from NERSC to the LBL booth was measured at
We utilized 12 hosts at OLCF to receive data, each with 24GB 16.4 ms.
of RAM and a Myricom 10GE NIC. These were all connected to
a 100Gbps Juniper router. We used 14 hosts at ALCF to receive 4.2 UDP shuffling
data, each with 48GB of RAM and a Mellanox 10GE NIC. These
Prior work by Bethel, et al. [6] has demonstrated that the TCP
hosts were connected to a 100Gbps Brocade router. Each host at
protocol is ill-suited for applications that need sustained high-
ALCF and OLCF had 2 quad-core Intel Nehalem processors. We
throughput utilization over a high-latency network channel. For
measured a round-trip time (RTT) of 50ms between NERSC and
visualization purposes, occasional packet loss is acceptable, we
ALCF, and 64ms between NERSC and OLCF. We used four hosts
therefore follow the approach of VisaPult[6] and use the UDP pro-
in the SC11 LBL booth, each with two 8-core AMD processors and
tocol for transferring the data for this demo.
64 GB of memory. Each host is equipped with Myricom 10Gbps
We prepared UDP packets by adding position (x, y, z) infor-
network adaptors, one dual port, and two single-port,connected to
mation in conjunction with the density information. While this
a 100Gbps Alcatel-Lucent router at the booth. Figure 2 shows the
increases the size of the streamed dataset by a factor of 3 (sum-
hosts that were used for the two 100Gbps applications at SC11.
ming up to a total of 16GB per timestep), this made the task of the
placing the received element into the right memory offset trivial.
4. VISUALIZING THE UNIVERSE AT Also, we experimented with different data decomposition schemes
100Gbps (z-ordered space filling curves) as opposed to a z-slice based order-
Computational cosmologists routinely conduct large scale simu- ing, and this scheme allowed us to experiment with both schemes
lations to test theories of formation (and evolution) of the universe. without any change in the packet packing/unpacking logic.
Ensembles of calculations with various parametrizations of dark As shown in Figure 4, a UDP packet contains a header followed
energy, for instance, are conducted on thousands of computational by a series of quad-value segments. In the header, the batch number
cores at supercomputing centers. The resulting datasets are visual- used for synchronization purposes, i.e., packets from different time
ized to understand large scale structure formation, and analyzed to steps have different batch numbers. An integer n is also included
check if the simulations are able to reproduce known observational in the header to specify the number of quad-value segments in this
statistics. In this demonstration, we used a modern cosmological packet. Each quad-value segment consists 3 integers, which are the
dataset produced by the NYX12 code. The computational domain X, Y and Z position in the 10243 matrix, and one float value which
is 10243 in size; each location contains a single precision floating is the particle density at this position. To maximize the packet size
point value corresponding to the dark matter density at each grid within the MTU value of 9000, the number n is set to 560 which
point. Each timestep corresponds to 4GB of data. We utilize 150 gives the optimal packet size of 8968 bytes, which is the largest
timesteps for our demo purposes. possible packet size under 8972 bytes (MTU size minus IP and
To demonstrate the difference between the 100Gbps network and UDP headers) with the above described data structure.
the previous 10Gbps network, we split the 100Gbps connection into For each time step, the input data is split into 32 streams along
two parts. 90Gbps of the bandwidth is used to transfer the full the z-direction; each stream contains a contiguous slice of the size
dataset. 10Gbps of the bandwidth is used to transfer 1/8th of the 1024 ∗ 1024 ∗ 32. Each stream is staged, streamed and received
same dataset at the same resolution. By comparing the real-time separately for the purpose of reliability and maximizing parallel
head-to-head streaming and rendering results of the two cases, the throughput. Figure 4 shows the flow of data for one stream. A
enhanced capabilities of the 100Gbps network are clearly demon- stager first reads the data into a memory-backed file system (/de-
strated. v/shm), it is optimized to reach the maximum read performance of
the underlying file system. The stager also buffers as many future
4.1 Demo Configuration time steps as possible, to minimize the effect of the filesystem load
variation.
Figure 3 illustrates the hardware configuration used for this
A shuffler then opens the staged file from /dev/shm, and trans-
demo. On the NERSC side, the 16 servers described above, named
mits UDP packets inside the file. After the shuffling is finished,
"Sender 01-16", are used to send data. The data resides on a the
the file is removed from /dev/shm, so that the stager can stage in a
GPFS file system. In the LBL booth, four hosts, named "Re-
future time step. To control the rate of each UDP stream, we use
ceive/Render H1-H4", are used to receive data for the high band-
the Rate Control tool developed for the Visapult project [6]. Rate
width part of the demo. Each server has two 8-core AMD proces-
Control can accurately calibrate the data transmission rate of the
sors and 64 GB of system memory. Each host is equipped with 2
UDP stream to the computational horsepower of the CPU core.
Myricom dual-port 10Gbps network adaptors which are connected
The receiver task allocates a region in /dev/shm upon initializa-
to the booth Alcatel-Lucent router via optical fibers. The "Re-
tion, which corresponds to the size of the slice. For each UDP
ceive/Render" servers are connected to the "High Bandwidth Vis
packet it receives in the transmitted stream, the receiver decodes
Server" via 1Gbps ethernet connections. The 1Gbps connection is
the packet and places the particle density values at the proper offset
used for synchronization and communication of the rendering ap-
in shared memory. The rendering software spawns 32 processes
plication, not for transfer of the raw data. A HDTV is connected
across all the Receiver/Render servers, each process opens the cor-
to this server to display rendered images. For the low bandwidth
responding data slice from /dev/shm in read-only mode, and renders
part of the demo, one server, named "Low Bandwidth Receive/Ren-
the data to produce an image.
der/Vis", is used to receive and render data. A HDTV is also con-
For the high bandwidth demo, the rate of the shufflers is set
nected to this server to display rendered images. The low band-
2.81Gbps, so that the total of 32 streams utilizes 90Gbps of the
width host is equipped with 1 dual-port 10Gbps network adaptor
total bandwidth. For the low bandwidth demo, 4 streams are used,
which is connected to the booth router via 2 optical fibers. The
transferring 1/8 of the full data set. The rate of the shufflers is set
12
NYX: https://ccse.lbl.gov/Research/NYX/index.html to 2.5Gbps to utilize 10Gbps of the total bandwidth.
Infiniband 10GigE 1 GigE
Connection Connection Connection
Sender 01 Receive/
Render
Sender 02 H1
Receive/
Sender 03 Render
H2
… …
Receive/
LBL Render
Sender 16 NERSC 100G H3
Pipe
Booth
Router
Router Receive/
Render
H4
IB Cloud
Low High
Bandwidth Bandwidth
Display Display
UDP Packet
4.4 Rendering
We used Paraview13 , an open-source, parallel, high performance
Batch#,n X1Y1Z1D1 X2Y2Z2D2 …… XnYnZnDn scientific visualization package for rendering the cosmological
In the final run n=560, packet size is 8968 bytes dataset. We used a ray-casting based volume rendering technique
to produce the images shown in Figure 6. The cubic volume is
Figure 4: UDP Packet decomposed in a z-slice order into 4 segments and streamed to in-
dividual rendering nodes. Paraview uses 8 cores on each rendering
node to produce intermediate images and then composites the im-
GPFS Flow of Data age using sort-last rendering over a local 10Gbps network. The
Flash
final image is displayed on a front-end node connected to a display.
Since the streaming tasks are decoupled from the rendering
Stager Shuffler Receiver Render SW tasks, Paraview is essentially asked to volume render images as fast
as possible in an endless loop. It is possible, and we do observe ar-
tifacts in the rendering as the real-time streams deposit data into
different regions in memory. In practice, the artifacts are not dis-
/dev/shm /dev/shm
tracting. We acknowledge that one might want to adopt a different
Send Server at NERSC Receive Server at SC Booth mode of rendering (using pipelining and multiple buffers) to stream
data, corresponding to different timesteps, into distinct regions in
Figure 5: Flow of Data memory.
4.5 Optimizations
4.3 Synchronization Strategy On the 16 sending servers, only 2-3 stagers and 2-3 shufflers
The synchronization is performed at the NERSC end. All shuf- are running at any given time; the load is relatively light and no
flers, including 32 for high bandwidth demo and 4 for low band special tuning is necessary to sustain the 2-3 UDP streams (<3Gbps
demo, are listening to a UDP port for the synchronization packet. each). On both high bandwidth and low bandwidth receive/render
Sent out from a controller running on a NERSC host, the synchro- servers, the following optimizations are implemented (as shown in
nization packets contains the location of the next file to shuffle Figure 7):
out. Upon receiving this synchronization packet, a shuffler will
stop shuffling the current time step (if it is unfinished), and start • Each 10Gbps NIC in the system is bound to a specific core by
shuffling the next time step, until it has shuffled all data in the time assigning all the interrupts to that core. For the servers with 4
step, or receives the next synchronization packet. This mechanism ports, each NIC is bound to a core in a different NUMA node;
ensures all the shufflers, receivers, and renders are synchronized to • Two receivers are bound to a 10Gbps NIC by binding the pro-
the same time step. cesses to the same core as the port;
We also made an important decision to decouple the streaming • For each stream, the render process is bound to the same NUMA
tasks from the rendering tasks on each host. The cores responsible node as the receiver, but to a different core;
for unpacking UDP packets, place the data into a memory-mapped • To minimize the NUMA effect, for each stream, the memory
file location. This mmap’ed region is dereferenced in the rendering region in /dev/shm is preallocated to make sure it resides in the
processes. There is no communication or synchronization between same NUMA nodes as the receivers and rendering processes.
the rendering tasks and streaming tasks on each node.
13
Paraview http://www.paraview.org
Figure 6: Volume rendering of a timestep from the cosmology dataset. The 90Gbps stream is shown on the left, 10Gbps on the right
ANI 100G
achieve more than 33Gbps. 4x10GE (MM)
4x10GE
100G
Router
Router
4x10GE
Optimal utilization of the network bandwidth on modern linux Figure 13: ANI 100Gbps Testbed Configuration used for Host Tun-
hosts requires a fair amount of tuning. There are several studies on ing Experiments
Interrupt coalescing (TCP) 24 36.8 53.3333333
Interrupt coalescing (UDP) 21.1 38.8 83.8862559
IRQ Binding (TCP) 30.6 36.8 20.2614379
IRQ Binding (UDP) 27.9 38.5 37.9928315
100Gbps link with only 10 TCP sockets, one per 10Gbps NIC. We Host Tuning Results
45
used four 10GE NICS in two of the hosts, and two (out of 4) 10GE
40
NICS in the third host, as shown in Figure 13. The results of the 35
two applications described in this paper used some, but not all of 30
these techniques, as they both used more than 10 hosts on each end 25
Gbps
instead of just 3, thereby requiring fewer tuning optimizations. Us- 20
without tuning
ing all of the tuning optimizations described below, we were able 15 with tuning
10
to achieve a total of 60 Gbps throughput (30 Gbps in each direc-
5
tion) on a single host with four 10GE NICS. 60 Gbps is the limit
0
of the PCI bus. We were also able to achieve 94 Gbps TCP and 97 Interrupt Interrupt IRQ Binding IRQ Binding
Gbps UDP throughput in one direction using just 10 sockets. UDP coalescing coalescing (TCP) (UDP)
(TCP) (UDP)
is CPU bound on the send host, indicating that just under 10Gbps
per flow is the UDP limit of today’s CPU cores regardless of NIC
speed. Figure 14: Host Tuning Results
[3] M. Balman and T. Kosar. Data scheduling for large scale distributed [20] A. Sim, M. Balman, D. Williams, A. Shoshani, and V. Natarajan.
applications. In Proceedings of the 5th ICEIS Doctoral Consortium, Adaptive transfer adjustment in efficient bulk data transfer manage-
in conjunction with the International Conference on Enterprise Infor- ment for climate dataset. In Parallel and Distributed Computing and
mation Systems (ICEIS’07), 2007. Systems, 2010.
[4] M. Balman and T. Kosar. Dynamic adaptation of parallelism level in [21] T. Yoshino et al. Performance Optimization of TCP/IP over 10 gigabit
data transfer scheduling. Complex, Intelligent and Software Intensive Ethernet by Precise Instrumentation. Proceedings of the ACM/IEEE
Systems, International Conference, 0:872–877, 2009. conference on Supercomputing, 2008.
[22] D. Thain and C. Moretti. Efficient access to many samall files in a
[5] BES Science Network Requirements, Report of the Basic Energy Sci- filesystem for grid computing. In Proceedings of the 8th IEEE/ACM
ences Network Requirements Workshop. Basic Energy Sciences Pro- International Conference on Grid Computing, GRID ’07, pages 243–
gram Office, DOE Office of Science and the Energy Sciences Net- 250, Washington, DC, USA, 2007. IEEE Computer Society.
work, 2007.
[23] W. Wu, P. Demar, and M. Crawford. Sorting reordered packets with
[6] E. W. Bethel. Visapult – A Prototype Remote and Distributed Applica- interrupt coalescing. Comput. Netw., 53:2646–2662, October 2009.
tion and Framework. In Proceedings of Siggraph 2000 – Applications
and Sketches. ACM/Siggraph, July 2000. [24] E. Yildirim, M. Balman, and T. Kosar. Dynamically tuning level of
parallelism in wide area data transfers. In Proceedings of the 2008
17
iperf: http://iperf.sourceforge.net/ international workshop on Data-aware distributed computing, DADC
18 ’08, pages 39–48. ACM, 2008.
nuttcp: http://www.nuttcp.net