0% found this document useful (0 votes)
30 views11 pages

Didc LBNL

Uploaded by

Duc Mai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views11 pages

Didc LBNL

Uploaded by

Duc Mai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/254005699

Experiences with 100Gbps network applications

Article · June 2012


DOI: 10.1145/2286996.2287004

CITATIONS READS
10 642

9 authors, including:

Mehmet Balman Eric Pouyoul


Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory
34 PUBLICATIONS 392 CITATIONS 29 PUBLICATIONS 777 CITATIONS

SEE PROFILE SEE PROFILE

Yushu Yao E. Wes Bethel


Lawrence Berkeley National Laboratory San Francisco State University
36 PUBLICATIONS 932 CITATIONS 186 PUBLICATIONS 3,648 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Mr Prabhat on 23 November 2016.

The user has requested enhancement of the downloaded file.


Experiences with 100Gbps Network Applications

Mehmet Balman, Eric Pouyoul, Yushu Yao, E. Wes Bethel


Burlen Loring, Prabhat, John Shalf, Alex Sim, and Brian L. Tierney
Lawrence Berkeley National Laboratory
One Cyclotron Road
Berkeley, CA, 94720, USA
{mbalman,epouyoul,yyao,ewbethel,bloring,prabhat,jshalf,asim,btierney}@lbl.gov

SC11

ABSTRACT NERSC
ANL STAR AOFA

SALT
SUNN
100Gbps networking has finally arrived, and many research and
ORNL
educational institutions have begun to deploy 100Gbps routers and
services. ESnet and Internet2 worked together to make 100Gbps
networks available to researchers at the Supercomputing 2011 con- Router site
Optical regen site
ference in Seattle Washington. In this paper, we describe two of
the first applications to take advantage of this network. We demon-
strate a visualization application that enables remotely located sci- Figure 1: 100Gbps Network for Supercomputing 2011
entists to gain insights from large datasets. We also demonstrate
climate data movement and analysis over the 100Gbps network.
We describe a number of application design issues and host tuning
strategies necessary for enabling applications to scale to 100Gbps times faster just because there was more bandwidth available. The
rates. same is true today with the leap for 10Gbps to 100Gbps networks.
One needs to pay close attention to application design and host tun-
Categories and Subject Descriptors ing in order to be able to take advantage of the higher network ca-
C.2.5 [Local and Wide-Area Networks]: High-speed pacity. Some of these issues are similar to those of 10 years ago,
; C.2.4 [Distributed Systems]: Distributed applications such as I/O pipelining and TCP tuning, but some are different due
to the fact that we have many more CPU cores involved.
ESnet and Internet2, the two largest research and education net-
General Terms work providers in the USA, worked together to make 100Gbps net-
Performance works available to researchers at the Supercomputing 2011 (SC11)
conference in Seattle Washington, November 2011. This network,
Keywords shown in Figure 1, included a 100Gbps connection between Na-
100Gbps Networking, Data Intensive Distributed Computing, Data tional Energy Research Scientific Computing Center (NERSC) at
Movement, Visualization Lawrence Berkeley National Laboratory (LBNL) in Oakland, CA,
Argonne National Laboratory (ANL) near Chicago, IL, and Oak
Ridge National Laboratory (ORNL) in Tennessee.
1. INTRODUCTION In this paper, we describe two of the first applications to take
Modern scientific simulations and experiments produce an un- advantage of this network. The first application demonstrates real-
precedented amount of data. End-to-end infrastructure is required time streaming and visualization of a 600 Gigabyte cosmology
to store, transfer and analyze these datasets to gain scientific in- dataset. We illustrate how enhanced network capability enables re-
sights. While there has been a lot of progress in computational motely located scientists to gain insights from large data volumes.
hardware, distributed applications have been hampered by the lack The second application showcases data distribution for climate sci-
of high-speed networks. Today, we have finally crossed the barrier ence. We demonstrate how scientific data movement and analy-
of 100Gbps networking; these networks are increasingly becoming sis between geographically disparate supercomputing facilities can
available to researchers, opening up new avenues for tackling large benefit from high-bandwidth networks.
data challenges. The paper is organized as follows. First, we briefly review back-
When we made a similar leap from 1Gbps to 10Gbps about 10 ground information, and provide details about our testbed config-
years ago, distributed applications did not automatically run 10 uration for the SC11 demonstrations. Next, we provide technical
information and optimization strategies utilized in the visualiza-
Copyright 2012 Association for Computing Machinery. ACM acknowl- tion demo. We then describe a climate data movement application
edges that this contribution was authored or co-authored by an employee, and introduce a data streaming tool for high-bandwidth networks.
contractor or affiliate of the U.S. Government. As such, the Government re- We describe how the application design needed to be modified to
tains a nonexclusive, royalty-free right to publish or reproduce this article, scale to 100Gbps. We then discuss a number of Linux host tuning
or to allow others to do so, for Government purposes only.
DIDC’12, June 18, 2012, Delft, The Netherlands. strategies needed to achieve these rates. Finally, we state lessons
Copyright 2012 ACM 978-1-4503-1341-4/12/06 ...$10.00. learned in end-system configuration and application design to fully
utilize underlying network capacity and conclude with brief evalu-
ation and future directions in use of 100Gbps networks. their local workstations or facilities, and conduct a broad range of
visualization and analysis tasks locally. With the availability of the
100Gbps network, this mode of analysis is now feasible. To justify
2. BACKGROUND this claim, we demonstrate real-time streaming of a large multi-
Terabyte sized dataset in a few minutes from DOE’s production
2.1 The Need for 100Gbps Networks supercomputing facility NERSC, to four commodity workstations
Modern science is increasingly data-driven and collaborative in at SC11 in Seattle. For illustration purposes, we then demonstrate
nature. Large-scale simulations and instruments produce petabytes real-time parallel visualization of the same dataset.
of data, which is subsequently analyzed by tens to thousands of
geographically dispersed scientists. Although it might seem logi-
cal and efficient to collocate the analysis resources with the source
of the data (instrument or a computational cluster), this is not the
likely scenario. Distributed solutions – in which components are
scattered geographically – are much more common at this scale, for
a variety of reasons, and the largest collaborations are most likely
to depend on distributed architectures.
The Large Hadron Collider1 (LHC), the most well-known high-
energy physics collaboration, was a driving force in the deployment
of high bandwidth connections in the research and education world.
Early on, the LHC community understood the challenges presented
by their extraordinary instrument in terms of data generation, dis-
tribution, and analysis.
Many other research disciplines are now facing the same chal-
lenges. The cost of genomic sequencing is falling dramatically,
for example, and the volume of data produced by sequencers is
rising exponentially. In climate science, researchers must analyze
observational and simulation data sets located at facilities around
the world. Climate data is expected to exceed 100 exabytes by
Figure 2: SC11 100Gbps Demo Configuration
2020 [5]. The need for productive access to such data led to the de-
velopment of the Earth System Grid2 (ESG) [9], a global workflow
infrastructure giving climate scientists access to data sets housed at
modeling centers on multiple continents, including North America, 3. 100Gbps TEST ENVIRONMENT
Europe, Asia, and Australia. We performed our tests using a wide array of resources from
Efficient tools are necessary to move vast amounts of scientific DOE’s Advanced Network Initiative 4 (ANI) network and testbed5 ,
data over high-bandwidth networks, for such state-of-the-art col- and the DOE Magellan Project [16]. The ANI Network is a pro-
laborations. We evaluate climate data distribution over high-latency totype 100Gbps network connecting DOE’s three Supercomputer
high-bandwidth networks, and state the necessary steps to scale-up centers. These three centers include National Energy Research Sci-
climate data movement to 100Gbps networks. We have developed entific Computing Center6 (NERSC), Argonne Leadership Class
a new data streaming tool that provides dynamic data channel man- Facility7 , and Oak Ridge Leadership Class Facility8 . The ANI
agement and on-the-fly data pipelines for fast and efficient data ac- Testbed includes high-speed hosts at both NERSC and ALCF. The
cess. Data is treated as first-class citizen for the entire spectrum Magellan project included large clusters at both NERSC and ALCF.
of file sizes, without compromising on optimum usage of network 16 hosts at NERSC were designated as I/O nodes to be connected
bandwidth. In our demonstration, we successfully staged real- to the 100Gbps ANI network.
world data from the Intergovernmental Panel on Climate Change In the annual Supercomputing conference (SC11) held in Seattle,
(IPCC) Fourth Assessment Report (AR4) Phase 3, Coupled Model WA, ESnet9 and Internet210 worked together to bring a 100Gbps
Intercomparison Project3 (CMIP-3) into computing nodes across link from Seattle to Salt Lake City, where it was connected to ES-
the country at ANL and ORNL from NERSC data storage over the net’s ANI network, as shown in Figure 1.
100Gbps network in real-time. We utilized 16 hosts at NERSC to send data, each with 2 quad-
core Intel Nehalem processors and 48 GB of system memory. In
2.2 Visualization over 100Gbps addition to regular disk-based GPFS file system, these hosts are
Modern simulations produce massive amounts of datasets that also connected via Infiniband to a Flash-based file system for sus-
need further analysis and visualization. Often, these datasets can- tained I/O performance during the demonstration. The complete
not be moved from the machines that the simulations are conducted system, including the hosts and the GPFS11 filesystem can sustain
on. One has to resort to in situ analysis (i.e. conduct analysis while 4
the simulation is running), or remote rendering (i.e. run a client Advanced Network Initiative http://www.es.net/RandD/advanced-
networking-initiative/
on a local workstation, and render the data at the supercomputing 5
ANI Testbed http://sites.google.com/a/lbl.gov/ani-testbed/
center). While these modes of operation are often desirable, a class 6
National Energy Research Center http://www.nersc.gov
of researchers would much rather prefer to stream the datasets to 7
Argonne Leadership Class Facility http://www.alcf.anl.gov
1 8
The Large Hadron Collider http://lhc.web.cern.ch/lhc/ Oak Ridge Leadership Class Facility http://www.olcf.ornl.gov
2 9
Earth System Grid http://www.earthsystemgrid.org Energy Sciences Network http://www.es.net
3 10
CMIP3 Multi-Model Dataset Archive at PCMDI http://www- Internet2 http://www.internet2.edu
11
pcmdi.llnl.gov/ipcc/ GPFS http://www.ibm.com/systems/software/gpfs
an aggregated 16 GBytes/second read performance. Each host is
equipped with a Chelsio 10Gbps NIC which is connected to the
NERSC Alcatel-Lucent router. one-way latency from NERSC to the LBL booth was measured at
We utilized 12 hosts at OLCF to receive data, each with 24GB 16.4 ms.
of RAM and a Myricom 10GE NIC. These were all connected to
a 100Gbps Juniper router. We used 14 hosts at ALCF to receive 4.2 UDP shuffling
data, each with 48GB of RAM and a Mellanox 10GE NIC. These
Prior work by Bethel, et al. [6] has demonstrated that the TCP
hosts were connected to a 100Gbps Brocade router. Each host at
protocol is ill-suited for applications that need sustained high-
ALCF and OLCF had 2 quad-core Intel Nehalem processors. We
throughput utilization over a high-latency network channel. For
measured a round-trip time (RTT) of 50ms between NERSC and
visualization purposes, occasional packet loss is acceptable, we
ALCF, and 64ms between NERSC and OLCF. We used four hosts
therefore follow the approach of VisaPult[6] and use the UDP pro-
in the SC11 LBL booth, each with two 8-core AMD processors and
tocol for transferring the data for this demo.
64 GB of memory. Each host is equipped with Myricom 10Gbps
We prepared UDP packets by adding position (x, y, z) infor-
network adaptors, one dual port, and two single-port,connected to
mation in conjunction with the density information. While this
a 100Gbps Alcatel-Lucent router at the booth. Figure 2 shows the
increases the size of the streamed dataset by a factor of 3 (sum-
hosts that were used for the two 100Gbps applications at SC11.
ming up to a total of 16GB per timestep), this made the task of the
placing the received element into the right memory offset trivial.
4. VISUALIZING THE UNIVERSE AT Also, we experimented with different data decomposition schemes
100Gbps (z-ordered space filling curves) as opposed to a z-slice based order-
Computational cosmologists routinely conduct large scale simu- ing, and this scheme allowed us to experiment with both schemes
lations to test theories of formation (and evolution) of the universe. without any change in the packet packing/unpacking logic.
Ensembles of calculations with various parametrizations of dark As shown in Figure 4, a UDP packet contains a header followed
energy, for instance, are conducted on thousands of computational by a series of quad-value segments. In the header, the batch number
cores at supercomputing centers. The resulting datasets are visual- used for synchronization purposes, i.e., packets from different time
ized to understand large scale structure formation, and analyzed to steps have different batch numbers. An integer n is also included
check if the simulations are able to reproduce known observational in the header to specify the number of quad-value segments in this
statistics. In this demonstration, we used a modern cosmological packet. Each quad-value segment consists 3 integers, which are the
dataset produced by the NYX12 code. The computational domain X, Y and Z position in the 10243 matrix, and one float value which
is 10243 in size; each location contains a single precision floating is the particle density at this position. To maximize the packet size
point value corresponding to the dark matter density at each grid within the MTU value of 9000, the number n is set to 560 which
point. Each timestep corresponds to 4GB of data. We utilize 150 gives the optimal packet size of 8968 bytes, which is the largest
timesteps for our demo purposes. possible packet size under 8972 bytes (MTU size minus IP and
To demonstrate the difference between the 100Gbps network and UDP headers) with the above described data structure.
the previous 10Gbps network, we split the 100Gbps connection into For each time step, the input data is split into 32 streams along
two parts. 90Gbps of the bandwidth is used to transfer the full the z-direction; each stream contains a contiguous slice of the size
dataset. 10Gbps of the bandwidth is used to transfer 1/8th of the 1024 ∗ 1024 ∗ 32. Each stream is staged, streamed and received
same dataset at the same resolution. By comparing the real-time separately for the purpose of reliability and maximizing parallel
head-to-head streaming and rendering results of the two cases, the throughput. Figure 4 shows the flow of data for one stream. A
enhanced capabilities of the 100Gbps network are clearly demon- stager first reads the data into a memory-backed file system (/de-
strated. v/shm), it is optimized to reach the maximum read performance of
the underlying file system. The stager also buffers as many future
4.1 Demo Configuration time steps as possible, to minimize the effect of the filesystem load
variation.
Figure 3 illustrates the hardware configuration used for this
A shuffler then opens the staged file from /dev/shm, and trans-
demo. On the NERSC side, the 16 servers described above, named
mits UDP packets inside the file. After the shuffling is finished,
"Sender 01-16", are used to send data. The data resides on a the
the file is removed from /dev/shm, so that the stager can stage in a
GPFS file system. In the LBL booth, four hosts, named "Re-
future time step. To control the rate of each UDP stream, we use
ceive/Render H1-H4", are used to receive data for the high band-
the Rate Control tool developed for the Visapult project [6]. Rate
width part of the demo. Each server has two 8-core AMD proces-
Control can accurately calibrate the data transmission rate of the
sors and 64 GB of system memory. Each host is equipped with 2
UDP stream to the computational horsepower of the CPU core.
Myricom dual-port 10Gbps network adaptors which are connected
The receiver task allocates a region in /dev/shm upon initializa-
to the booth Alcatel-Lucent router via optical fibers. The "Re-
tion, which corresponds to the size of the slice. For each UDP
ceive/Render" servers are connected to the "High Bandwidth Vis
packet it receives in the transmitted stream, the receiver decodes
Server" via 1Gbps ethernet connections. The 1Gbps connection is
the packet and places the particle density values at the proper offset
used for synchronization and communication of the rendering ap-
in shared memory. The rendering software spawns 32 processes
plication, not for transfer of the raw data. A HDTV is connected
across all the Receiver/Render servers, each process opens the cor-
to this server to display rendered images. For the low bandwidth
responding data slice from /dev/shm in read-only mode, and renders
part of the demo, one server, named "Low Bandwidth Receive/Ren-
the data to produce an image.
der/Vis", is used to receive and render data. A HDTV is also con-
For the high bandwidth demo, the rate of the shufflers is set
nected to this server to display rendered images. The low band-
2.81Gbps, so that the total of 32 streams utilizes 90Gbps of the
width host is equipped with 1 dual-port 10Gbps network adaptor
total bandwidth. For the low bandwidth demo, 4 streams are used,
which is connected to the booth router via 2 optical fibers. The
transferring 1/8 of the full data set. The rate of the shufflers is set
12
NYX: https://ccse.lbl.gov/Research/NYX/index.html to 2.5Gbps to utilize 10Gbps of the total bandwidth.
Infiniband 10GigE 1 GigE
Connection Connection Connection

Sender 01 Receive/
Render
Sender 02 H1

Receive/
Sender 03 Render
H2
… …
Receive/
LBL Render
Sender 16 NERSC 100G H3
Pipe
Booth
Router
Router Receive/
Render
H4
IB Cloud

Low High Gigabit


Bandwidth Bandwidth Ethernet
Receive/ Vis Srv
Flash-based Render/Vis
GPFS Cluster

Low High
Bandwidth Bandwidth
Display Display

Figure 3: System diagram for the visualization demo at SC11.

UDP Packet
4.4 Rendering
We used Paraview13 , an open-source, parallel, high performance
Batch#,n X1Y1Z1D1 X2Y2Z2D2 …… XnYnZnDn scientific visualization package for rendering the cosmological
In the final run n=560, packet size is 8968 bytes dataset. We used a ray-casting based volume rendering technique
to produce the images shown in Figure 6. The cubic volume is
Figure 4: UDP Packet decomposed in a z-slice order into 4 segments and streamed to in-
dividual rendering nodes. Paraview uses 8 cores on each rendering
node to produce intermediate images and then composites the im-
GPFS Flow of Data age using sort-last rendering over a local 10Gbps network. The
Flash
final image is displayed on a front-end node connected to a display.
Since the streaming tasks are decoupled from the rendering
Stager Shuffler Receiver Render SW tasks, Paraview is essentially asked to volume render images as fast
as possible in an endless loop. It is possible, and we do observe ar-
tifacts in the rendering as the real-time streams deposit data into
different regions in memory. In practice, the artifacts are not dis-
/dev/shm /dev/shm
tracting. We acknowledge that one might want to adopt a different
Send Server at NERSC Receive Server at SC Booth mode of rendering (using pipelining and multiple buffers) to stream
data, corresponding to different timesteps, into distinct regions in
Figure 5: Flow of Data memory.

4.5 Optimizations
4.3 Synchronization Strategy On the 16 sending servers, only 2-3 stagers and 2-3 shufflers
The synchronization is performed at the NERSC end. All shuf- are running at any given time; the load is relatively light and no
flers, including 32 for high bandwidth demo and 4 for low band special tuning is necessary to sustain the 2-3 UDP streams (<3Gbps
demo, are listening to a UDP port for the synchronization packet. each). On both high bandwidth and low bandwidth receive/render
Sent out from a controller running on a NERSC host, the synchro- servers, the following optimizations are implemented (as shown in
nization packets contains the location of the next file to shuffle Figure 7):
out. Upon receiving this synchronization packet, a shuffler will
stop shuffling the current time step (if it is unfinished), and start • Each 10Gbps NIC in the system is bound to a specific core by
shuffling the next time step, until it has shuffled all data in the time assigning all the interrupts to that core. For the servers with 4
step, or receives the next synchronization packet. This mechanism ports, each NIC is bound to a core in a different NUMA node;
ensures all the shufflers, receivers, and renders are synchronized to • Two receivers are bound to a 10Gbps NIC by binding the pro-
the same time step. cesses to the same core as the port;
We also made an important decision to decouple the streaming • For each stream, the render process is bound to the same NUMA
tasks from the rendering tasks on each host. The cores responsible node as the receiver, but to a different core;
for unpacking UDP packets, place the data into a memory-mapped • To minimize the NUMA effect, for each stream, the memory
file location. This mmap’ed region is dereferenced in the rendering region in /dev/shm is preallocated to make sure it resides in the
processes. There is no communication or synchronization between same NUMA nodes as the receivers and rendering processes.
the rendering tasks and streaming tasks on each node.
13
Paraview http://www.paraview.org
Figure 6: Volume rendering of a timestep from the cosmology dataset. The 90Gbps stream is shown on the left, 10Gbps on the right

NUMA Node generation high-bandwidth networks need to be evaluated carefully


10G Port Mem Mem from the applications’ perspectives. In this section, we explore
Receiver 1 how climate applications can adapt and benefit from next gener-
Receiver 2 ation high-bandwidth networks.
Core Data volume in climate applications is increasing exponentially.
Render 1
For example, the recent "Replica Core Archive" data from the IPCC
Render 2 Fifth Assessment Report (AR5) is expected to be around 2PB [9],
whereas, the IPCC Forth Assessment Report (AR4) data archive
Mem Mem is only 35TB. This trend can be seen across many areas in sci-
ence [2, 8]. An important challenge in managing ever increas-
ing data sizes in climate science is the large variance in file sizes
Figure 7: NUMA Binding
[3, 20, 10]. Climate simulation data consists of a mix of relatively
small and large files with irregular file size distribution in each
dataset. This requires advanced middleware tools to move data ef-
We experimented with alternative core-binding strategies, how-
ficiently in long-distance high-bandwidth networks. We claim that
ever, the packet loss rate is minimized when binding both the NIC
with such tools, data can be treated as first-class citizen for the
and 2 receivers to the same core. We suspect that this is due to
entire spectrum of file sizes, without compromising on optimum
cache misses. For this reason, the main performance limitation is
usage of network-bandwidth.
the computational horsepower of the CPU core. A max receiving
To justify this claim, we present our experience from the SC11
rate of ≈ 2.7Gbps/stream can be reached when one core is han-
ANI demonstration, titled ‘Scaling the Earth System Grid to
dling two receiving streams and all the interrupts from the corre-
100Gbps Networks’. We used a 100Gbps link connecting National
sponding NIC port. This causes a ≈ 5% packet loss in the high
Energy Research Scientific Computing Center (NERSC), Argonne
bandwidth demo, when the shuffling rate is 2.81Gbps/stream. For
National Laboratory (ANL) and Oak Ridge National Laboratory
the low bandwidth demo, the packet loss is smaller than 1 percent.
(ORNL). For this demonstration, we developed a new data stream-
ing tool that provides dynamic data channel management and on-
the-fly data pipelines for fast and efficient data access.
The data from IPCC Fourth Assessment Report (AR4) phase 3,
4.6 Network Performance Results CMIP-3, with total size of 35TB, was used in our tests and demon-
We streamed 2.3TB of data from NERSC to SC11 show floor in strations. In general, it took approximately 30 minutes to move
Seattle in ≈ 3.4 minutes during live demonstrations at SC11. Each CMIP-3 data over 100Gbps. This would have taken around 5 hours
timestep, corresponding to 16GB of data, took ≈ 1.4 seconds to over a 10Gbps network, which is the expected 10-times gain in data
reach the rendering hosts at our SC11 booth. The volume rendering transfer performance. In the demo, CMIP-3 data was staged suc-
took an additional ≈ 2.5 seconds before the image was updated. cessfully into the memory of computing nodes across the country
Aggregating across the 90Gbps and 10Gbps demonstrations, we at ANL and ORNL from NERSC data storage over the 100Gbps
were able to achieve a peak bandwidth utilization of ≈ 99Gbps. network on demand.
We observed an average performance of ≈ 85Gbps during various
time periods at SC11. The bandwidth utilization information was 5.1 Motivation
obtained directly from the 100Gbps port statistics on the Alcatel- Climate data is one of the fastest growing scientific data sets.
Lucent router in the LBL booth. Simulation results are accessed by thousands of users around the
world. Many institutions collaborate on the generation and analysis
of simulation data. The Earth System Grid Federation14 (ESGF)
5. CLIMATE DATA OVER 100Gbps [9, 8] provides necessary middleware and software to support end-
High-bandwidth connections help increase throughput of scien- user data access and data replication between partner institutions.
tific applications, opening up new opportunities for sharing data High performance data movement between ESG data nodes is an
that were simply not possible with 10Gbps networks. However,
14
increasing the network bandwidth is not sufficient by itself. Next- Earth System Grid Federation: http://esgf.org/
important challenge, especially between geographically separated
data centers.
In this study, we evaluate the movement of bulk data from ESG
data nodes, and state the necessary steps to scale-up climate data
movement to 100Gbps high-bandwidth networks. As a real-world
example, we specifically focus on data access and data distribution
for the Coupled Model Intercomparison Project (CMIP) from In-
tergovernmental Panel on Climate Change (IPCC).
IPCC climate data is stored in common NetCDF data files. Meta-
data from each file, including the model, type of experiment, and
the institution that generated the data file are retrieved and stored
when data is published. Data publication is accomplished through
an Earth System Grid (ESG) gateway server. Gateways work in a
federated manner such that the metadata database is synchronized
between each gateway. The ESG system provides an easy-to-use
interface to search and locate data files according to given search
patterns. Data files are transferred from a remote repository using
advanced data transfer tools (e.g., GridFTP [1, 7, 20]) that are opti-
mized for fast data movement. A common use-case is replication of Figure 8: Climate Data Mover Framework
data to achieve redundancy. In addition to replication, data files are
copied into temporary storage in HPC centers for post-processing
and further climate analysis.
Depending on the characteristics of the experiments and sim- formance. Since we will have high-bandwidth access to the data,
ulations, files may have small sizes such as several hundreds of management and bookkeeping of data blocks would play an impor-
megabytes, or they can be as large as several gigabytes [9]. IPCC tant role in order to use remote storage resources efficiently over
data files are organized in a hierarchical directory structure. Di- the network.
rectories are arranged according to experiments, metadata charac- The standard file transfer protocol FTP establishes two network
teristics, organization lists, and simulation models. In addition to channels [18, 22]. The control channel is used for authentication,
having many small files, bulk climate data consists of many direc- authorization, and sending control messages such as what file is to
tories. This puts extra burden on filesystem access and network be transferred. The data channel is used for streaming the data to
transfer protocols. An important challenge in dealing with climate the remote site. In the standard FTP implementation, a separate
data movement is the lots-of-small-files problem [14, 22, 7]. Most data channel is established for every file. First, the file request is
of the end-to-end data transfer tools are designed for moving large sent over the control channel, and a data channel is established for
data files. State-of-the-art data movement tools require managing streaming the file data. Once the transfer is completed, a control
each file movement separately. Therefore, dealing with small files message is sent to notify that end of file is reached. Once acknowl-
imposes extra bookkeeping overhead, especially over high latency edgement for transfer completion is received, another file transfer
networks. can be requested. This adds at least three additional round-trip-
The Globus Project also recognized the performance issues with times over the control channel [7, 22]. The data channel stays
small files, and added a number of features to their GridFTP tool to idle while waiting for the next transfer command to be issued. In
address these [7]. This includes an option to do multiple files con- addition, establishing a new data channel for each file increases
currently (-concurrency), and an option to do pipelining (-pipeline). the latency between each file transfer. The latency between trans-
They also have the -fast option, which reuses the data channel op- fers adds up, as a results, overall transfer time increases and total
erations. Other similar parallel data mover tools include FDT [17]
from Caltech and bbcp from SLAC [12].

5.2 Climate Data Distribution over 100Gbps


Scientific applications for climate analysis are highly data-
intensive [5, 2, 9, 8]. A common approach is to stage data sets
into local storage, and then run climate applications on the local
data files. However, replication comes with its storage cost and
requires a management system for coordination and synchroniza-
tion. 100Gbps networks provide the bandwidth needed to bring
large amounts of data quickly on-demand. Creating a local replica
beforehand may no longer be necessary. By providing data stream-
ing from remote storage to the compute center where the applica-
tion runs, we can better utilize available network capacity and bring
data into the application in real-time. If we can keep the network
pipe full by feeding enough data into the network, we can hide the
effect of network latency and improve the overall application per- Figure 9: Climate Data Mover Server/Client Architecture
throughput decreases. This problem becomes more drastic for long
distance connections where round-trip-time is high.
Keeping the data channel idle also adversely affects the overall
performance for window-based protocols such as TCP. The TCP
protocol automatically adjusts the window size; the slow-start al-
gorithm increases the window size gradually. When the amount of
data sent is small, transfers may not be long enough to allow TCP
to fully open its window, so we can not move data at full speed.
On the other hand, data movement requests, both for bulk data
replication and data streaming for large-scale data analysis, deal
with a set of many files. Instead of moving data from a single file
at a time, the data movement middleware could handle the entire
data collection. Therefore, we have developed a simple data move-
ment utility, called the Climate Data Mover, that provides dynamic
data channel management and block-based data movement. Figure
8 shows the underlying system architecture. Data files are aggre- Figure 10: System diagram for the demo at SC11.
gated and divided into simple data blocks. Blocks are tagged and
streamed over the network. Each data block’s tag includes infor-
mation about the content inside. For example, regular file transfers Therefore, there is no need for further communication between
can be accomplished by adding the file name and index in the tag client and server in order to initiate file transfers. This is similar to
header. Since there is no need to keep a separate control channel, it having an on-the-fly ‘tar’ approach bundling and sending many files
does not get affected by file sizes and small data requests. The Cli- together. Moreover, by using our tool, data blocks can be received
mate Data Mover can be used both for disk-to-disk data replication and sent out-of-order and asynchronously. Figure 9 shows clien-
and also for direct data streaming into climate applications. t/server architecture for data movement over the network. Since we
do not use a control channel for bookkeeping, all communication is
5.3 Climate Data Mover mainly over a single data channel, over a fixed port. Bookkeeping
Data movement occurs in two steps. First, data blocks are read information is embedded inside each block. This has some benefits
into memory buffers (disk I/O). Then memory buffers are trans- for ease of firewall traversal over wide-area [15].
mitted over the network (network I/O). Each step requires CPU In our test case, we transfer data files from the NERSC GPFS
and memory resources. A common approach to increase over- filesystem into the memory of ALCF and OLCF nodes. The Cli-
all throughput is to use parallel streams, so that multiple threads mate Data Mover server initiates multiple front-end and back-end
(and CPU cores) work simultaneously to overcome the latency cost threads. The front-end component reads data, attaches a file name
generated by disk and memory copy operation in the end system. and index information, and releases blocks to be sent to the client.
Another approach is to use concurrent transfers, where multiple The client at the remote site receives data blocks and makes them
transfer tasks cooperate together to generate high throughput data ready to be processed by the corresponding front-end threads. For a
in order to fill the network pipe [24, 4]. In standard file transfer disk-to-disk transfer, the client’s front-end can simply call file write
mechanisms, we need more parallelism to overcome the cost of operations. The virtual object also acts as a large cache. For disk
bookkeeping and control messages. An important drawback in us- to memory, the front-end keeps the data blocks and releases them
ing application level tuning (parallel streams and concurrent trans- once they are processed by the application. The main advantage
fers) is that they cause extra load on the system and resources are with this approach is that we are not limited by the characteristics
not used efficiently. Moreover, the use of many TCP streams may of the file sizes in the dataset. Another advantage over FTP-based
oversubscribe the network and cause performance degradations. tools is that we can dynamically increase/descrease the parallelism
In order to be able to optimally tune the data movement through level both in the network communication and I/O read/write oper-
the system, we decoupled network and disk I/O operations. Trans- ations, without closing and reopening the data channel connection
mitting data over the network is logically separated from the read- (as is done in regular FTP variants).
ing/writing of data blocks. Hence, we are able to have different
parallelism levels in each layer. Our data streaming utility, the Cli- 5.4 Test Results
mate Data Mover, uses a simple network library consisting of two Figure 10 represents the overall system details for the SC11
layers: a front-end and a back-end. Each layer works indepen- demo. We used 10 host pairs, each connected to the network with a
dently so that we can measure performance and tune each layer 10 Gbps link. We used TCP connection between host pairs; default
separately. Those layers are tied to each other with a block-based settings have been used so that TCP window size is not set specif-
virtual object, implemented as a set of shared memory blocks. In ically. We have tested the network performance between NERSC
the server, the front-end is responsible for the preparation of data, and ANL/ORNL with various parameters, such as, total size of vir-
and the back-end is responsible for the sending of data over the tual object, thread count for reading from GPFS filesystem, and
network. On the client side, the back-end components receive data multiple TCP streams to increase the utilization of the available
blocks and feed the virtual object, so the corresponding front-end bandwidth. According to our test results, we have manually de-
can get and process data blocks. termined best set of parameters for the setup. Specific host tuning
The front-end component requests a contiguous set of memory issues about IRQ binding and interrupt coalescing described in the
blocks from the virtual object. Once they are filled with data, those following section have not been applied, and are open to future ex-
blocks are released, so that the back-end components can retrieve plorations.
and transmit the blocks over the network. Data blocks in the vir- Our experiment moved data stored at NERSC to application
tual object include content information, i.e., file id, offset and size. nodes at both ANL and ORNL. We staged the Coupled Model
Figure 11: SC11 Climate100 demonstration results, showing the data transfer throughput

Intercomparison Project (CMIP) data set from Intergovernmental


Panel on Climate Change (IPCC) from the GPFS filesystem at
NERSC. A total 35TB was transferred in about 30 minutes. The
default filesystem block size was set to 4MB, so we also used 4MB
blocks in Climate Data Mover for better read performance. Each
block’s data section was aligned according to the system pagesize.
Total size of the virtual object was 1GB both at the client and the
server applications. The servers at NERSC used eight front-end
threads on each host for reading data files in parallel. The clients
used four front-end threads for processing received data blocks. In
the demo, four parallel TCP streams (four back-end threads) were
used for each host-to-host connection. We observed 83 Gbps total Figure 12: GridFTP vs. Climate Data Mover (CDM)
throughput both using both NERSC to ANL and NERSC to ORNL,
as shown in Figure 11.

5.5 Performance of Climate Data Mover


The Climate Data Mover is able to handle both small and large network performance optimization in 10Gbps networks [21, 23].
files with high performance. Although a detailed analysis of it is However, only a few recent studies have tested high-speed data
beyond the scope of this paper, we present a brief comparison with transfers in a 100Gbps environment. One of these is the team at In-
GridFTP, for evaluating performance of the Climate Data Mover in diana University’s testing of the Lustre filesystem over the 100Gbps
transferring small files. Two hosts, one at NERSC and one at ANL, network at SC11 [13]. Other recent studies include presentations
were used from the ANI 100Gbps Testbed. Each host is connected by Rao [19] and by the team at Nasa Goddard [11]. They all have
with four 10Gbps NICs. Total throughput is 40Gbps between two found that a great deal of tuning was required. In this section we
hosts. We did not have access to a high performance file system; specifically discuss modifications that helped us increase the total
therefore, we simulate the effect of file sizes by creating a mem- NIC throughput. These additional host tuning tests were done after
ANI Middleware Testbed
ory file system (tmpfs with a size of 20G). We created files with the SC11 conference on ESnet’s ANI 100GbpsTestbed, shown in
various sizes (i.e., 10M, 100M, 1G) and transferred those files con- Figure 13.
tinuously while measuring the performance. In both Climate Data We conducted these series of experiments on three hosts at
Mover and GridFTP experiments, TCP buffer size is set to 50MB in NERSC connected with a 100Gbps link to three hosts at ANL. Af-
order to get best throughput. The pipelining feature was enabled in ter adjusting the tuning knobs, we were able to effectively fill the
GridFTP. A long file list (repeating file names that are in the mem-
ory filesystem) is given as input. Figure 12 shows performance
results with 10MB files. We initiated four server applications at ANI 100G Network
(RTT = 47ms)
ANL node (each running on a separate NIC), and four client appli- NERSC ANL
cations at NERSC node. In the GridFTP tests, we tried both 16 and 4x10GE
100G 100G
4x 10GE
32 concurrent streams (-cc option). The Climate Data Mover was
able to achieve 37Gbps of throughput, while GridFTP was not able 4x10GE 4x10GE

ANI 100G
achieve more than 33Gbps. 4x10GE (MM)
4x10GE
100G
Router
Router

4x10GE

6. HOST TUNING ISSUES


Updated February 13, 2012

Optimal utilization of the network bandwidth on modern linux Figure 13: ANI 100Gbps Testbed Configuration used for Host Tun-
hosts requires a fair amount of tuning. There are several studies on ing Experiments
Interrupt coalescing (TCP) 24 36.8 53.3333333
Interrupt coalescing (UDP) 21.1 38.8 83.8862559
IRQ Binding (TCP) 30.6 36.8 20.2614379
IRQ Binding (UDP) 27.9 38.5 37.9928315

100Gbps link with only 10 TCP sockets, one per 10Gbps NIC. We Host Tuning Results
45
used four 10GE NICS in two of the hosts, and two (out of 4) 10GE
40
NICS in the third host, as shown in Figure 13. The results of the 35
two applications described in this paper used some, but not all of 30
these techniques, as they both used more than 10 hosts on each end 25

Gbps
instead of just 3, thereby requiring fewer tuning optimizations. Us- 20
without tuning
ing all of the tuning optimizations described below, we were able 15 with tuning
10
to achieve a total of 60 Gbps throughput (30 Gbps in each direc-
5
tion) on a single host with four 10GE NICS. 60 Gbps is the limit
0
of the PCI bus. We were also able to achieve 94 Gbps TCP and 97 Interrupt Interrupt IRQ Binding IRQ Binding
Gbps UDP throughput in one direction using just 10 sockets. UDP coalescing coalescing (TCP) (UDP)
(TCP) (UDP)
is CPU bound on the send host, indicating that just under 10Gbps
per flow is the UDP limit of today’s CPU cores regardless of NIC
speed. Figure 14: Host Tuning Results

6.1 TCP and UDP Tuning


We observed a latency of 23 ms on the path from NERSC to ANL It is important to verify that a number of BIOS settings are prop-
each way, therefore the TCP window size needed to be increased erly set to obtain maximum performance. We modified the follow-
to hold the entire bandwidth delay product of 64 MB, as described ing settings in the course of our experiments:
on fasterdata 15 For UDP, the optimal packet size to use is 8972
bytes (path MTU minus the IP and UDP header). For UDP, it was • hyper-threading: We disabled hyper-threading. While it sim-
also important to increase SO_SNDBUF and SO_RCVBUF using ulates more cores than are physically present, this can reduce
the setsockopt system call to 4MB in order to saturate a 10GE NIC. performance under variable load conditions
Since these tests were conducted on a dedicated network with no • memory speed: The default BIOS setting did not set the mem-
competing traffic or congestion, there was no reason to use multiple ory bus speed to the maximum.
TCP flows per NIC; a single flow was able to fill the 10G pipe. • cpuspeed: The default BIOS setting did not set the CPU speed
to the maximum.
6.2 IRQ and Thread Binding • energy saving: We disabled this to ensure the CPU was always
running at the maximum speed.
With modern multi-socket multi-core architectures, there is a
large performance penalty if the NIC interrupts are being handled
on one processor, while the user read process is on a different pro- 6.5 Open Issues
cessor, as data will need to be copied between processors. We This paper has shown that one needs to perform a lot of hand-
solved this problem using the following approach. First, we dis- tuning in order to saturate a 100Gbps networks. We need to opti-
abled irqbalance. irqbalance distributes interrupts across CPUs to mize CPU utilization in order to achieve higher networking perfor-
optimize for L2 cache hits. While this optimization is desirable in mance. The Linux operating system provides a service, irqbalance,
general, it does not maximize performance in this test configura- but as this paper has shown, its generic algorithm fails under cer-
tion. Therefore we explicitly mapped the interrupt for each NIC to tain workloads. Statically binding IRQs and send/receive threads
a separate processor core, as described on fasterdata, 16 and used to a dedicated core is a simple solutions that works well in a well
the sched_setaffinity system call to bind the network send or re- defined, predictable, environment, but quickly becomes difficult to
ceive process to that same core. For a single flow, irqbalance does manage in a production context where many different applications
slightly better than the static binding, but, for four streams, irqbal- may have to be deployed.
ance performs much worse than our static binding. This is because PCI Gen3 motherboards are just becoming available, allowing
as the overall CPU load goes up, the IRQ distribution policy be- up to 8Gbps per lane. These motherboards will allow a single NIC
comes less likely to provide the right binding to the ethernet inter- to theoretically go up to 64Gbps. In this environment more parallel
rupts, leading to the host dropping packets and TCP backing off. flows will be needed, as current CPU speed limits UDP flows to
Figure 14 shows the results for TCP and UDP performance of 4 around 10Gbps, and TCP flows to around 18Gbps. This problem
streams using irqbalance compared to manual IRQ binding. There will be further exacerbated when PCIe Gen4, slated for 2015, will
is an 20 percent improvement for TCP, and 38 percent for UDP. further increase PCI bandwidth.
Last but not least, the various experiments reported in the paper
6.3 NIC Tuning were running in a closed network with no competing traffic. More
thorough testing will be needed to identify and address issues in a
We used two well-known techniques to reduce the number of
production network.
interrupts to improve performance: Jumbo frames (9000 byte
MTUs), and interrupt coalescing. We also increased txqueuelen
to 10000. Figure 14 shows that enabling interrupt coalescing and 7. CONCLUSION
setting it to 100 milliseconds provides a 93 percent performance
100Gbps networks have arrived, and with careful application de-
gain for UDP and 53 percent for TCP.
sign and host tuning, a relatively small number of hosts can fill a
100Gbps pipe. Many of the host tuning techniques from 1Gbps to
6.4 BIOS Tuning 10Gbps transition still apply. These include TCP/UDP buffer tun-
ing, using jumbo frames, and using interrupt coalescing. With the
15
http://fasterdata.es.net/host-tuning/background/ current generation of multi-core systems, IRQ binding also is now
16
http://fasterdata.es.net/host-tuning/interrupt-binding/ essential for maximizing host performance.
While application of these tuning techniques will likely improve [7] J. Bresnahan, M. Link, R. Kettimuthu, D. Fraser, and I. Foster. Gridftp
the overall throughput of the system, it is important to follow an ex- pipelining. In Proceedings of the 2007 TeraGrid Conference, June
perimental methodology in order to systematically increase perfor- 2007.
mance. In some cases a particular tuning strategy may not achieve [8] D. N. Williams et al. Data Management and Analysis for the Earth
the expected results. We recommend starting with the simplest core System Grid. Journal of Physics: Conference Series, SciDAC 08 con-
I/O operation possible, and then adding layers of complexity on ference proceedings, volume 125 012072, 2008.
top of that. For example, one can first tune the system for a sin- [9] D. N. Williams et al. Earth System Grid Federation: Infrastructure to
gle network flow using a simple memory to memory transfer tool Support Climate Science Analysis as an International Collaboration.
such as Iperf17 or nuttcp18 . Next optimize the multiple concurrent Data Intensive Science Series: Chapman & Hall/CRC Computational
streams, trying to model the application behavior as closely as pos- Science, ISBN 9781439881392, 2012.
sible. Once this is done, the final step is to tune the application [10] S. Doraimani and A. Iamnitchi. File grouping for scientific data man-
itself. The main goal of this methodology is to verify performance agement: lessons from experimenting with real traces. In Proceedings
for each component in the critical path. Applications may need to of the 17th international symposium on High performance distributed
be redesigned to get the best out of high-bandwidth networks. In computing, HPDC ’08, pages 153–164, 2008.
this paper, we demonstrated two such applications to take advan- [11] P. Gary, B. Find, and P. Lang. Introduction to GSFC High
tage of this network, and described a number of application design End Computing: 20, 40 and 100 Gbps Network Testbeds.
issues and host tuning strategies necessary for enabling those ap- http://science.gsfc.nasa.gov/606.1/docs/HECN_
plications to scale to 100Gbps. 10G_Testbeds_082210.pdf, 2010.
[12] A. Hanushevsky, A. Trunov, and L. Cottrell. Peer-to-peer computing
Acknowledgments for secure high performance data copying. In Proceedings of comput-
ing in high energy and nuclear physics, September 2001.
We would like to thank Peter Nugent and Zarija Lukic for providing us
with the cosmology datasets used in the visualization demo. Our thanks go [13] IU showcases innovative approach to networking at SC11 SCinet Re-
out to Patrick Dorn, Evangelos Chaniotakis, John Christman, Chin Guok, search Sandbox. http://ovpitnews.iu.edu/news/page/
Chris Tracy and Lauren Rotman for assistance with 100Gbps installation normal/20445.html, 2011.
and testing at SC11. Jason Lee, Shane Canon, Tina Declerck and Cary
Whitney provided technical support with NERSC hardware. Ed Holohan, [14] R. Kettimuthu, S. Link, J. Bresnahan, M. Link, and I. Foster. Globus
Adam Scovel, and Linda Winkler provided support at ALCF. Jason Hill, xio pipe open driver: enabling gridftp to leverage standard unix tools.
Doug Fuller, and Susan Hicks provided support at OLCF. Hank Childs, In Proceedings of the 2011 TeraGrid Conference: Extreme Digital
Mark Howison and Aaron Thomas assisted with troubleshooting visualiza- Discovery, TG ’11, pages 20:1–20:7. ACM, 2011.
tion software and hardware.
[15] R. Kettimuthu, R. Schuler, D. Keator, M. Feller, D. Wei, M. Link,
This work was supported by the Director, Office of Science, Office of
J. Bresnahan, L. Liming, J. Ames, A. Chervenak, I. Foster, and
Basic Energy Sciences, of the U.S. Department of Energy under Contract
C. Kesselman. A Data Management Framework for Distributed
No. DE-AC02-05CH11231. This research used resources of the ESnet
Biomedical Research Environments. e-Science Workshops, 2010
Advanced Network Initiative (ANI) Testbed, which is supported by the Of-
Sixth IEEE International Conference on, 2011.
fice of Science of the U.S. Department of Energy under the contract above,
funded through the The American Recovery and Reinvestment Act of 2009. [16] Magellan Report On Cloud Computing for Science. http:
//science.energy.gov/~/media/ascr/pdf/program-
documents/docs/Magellan_Final_Report.pdf, 2011.
References
[1] W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, [17] Z. Maxa, B. Ahmed, D. Kcira, I. Legrand, A. Mughal, M. Thomas,
I. Raicu, and I. Foster. The globus striped gridftp framework and and R. Voicu. Powering physics data transfers with fdt. Journal of
server. In Proceedings of the 2005 ACM/IEEE conference on Super- Physics: Conference Series, 331(5):052014, 2011.
computing, SC ’05, pages 54–, Washington, DC, USA, 2005. IEEE [18] J. Postel and J. Reynolds. File Transfer Protocol. RFC 959 (Standard),
Computer Society. Oct. 1985. Updated by RFCs 2228, 2640, 2773, 3659, 5797.
[2] M. Balman and S. Byna. Open problems in network-aware data man- [19] N. Rao and S. Poole. DOE UltraScience Net: High-
agement in exa-scale computing and terabit networking era. In Pro- Performance Experimental Network Research Testbed.
ceedings of the first international workshop on Network-aware data http://computing.ornl.gov/SC11/documents/Rao_
management, NDM ’11, pages 73–78, 2011. UltraSciNet_SC11.pdf, 2011.

[3] M. Balman and T. Kosar. Data scheduling for large scale distributed [20] A. Sim, M. Balman, D. Williams, A. Shoshani, and V. Natarajan.
applications. In Proceedings of the 5th ICEIS Doctoral Consortium, Adaptive transfer adjustment in efficient bulk data transfer manage-
in conjunction with the International Conference on Enterprise Infor- ment for climate dataset. In Parallel and Distributed Computing and
mation Systems (ICEIS’07), 2007. Systems, 2010.

[4] M. Balman and T. Kosar. Dynamic adaptation of parallelism level in [21] T. Yoshino et al. Performance Optimization of TCP/IP over 10 gigabit
data transfer scheduling. Complex, Intelligent and Software Intensive Ethernet by Precise Instrumentation. Proceedings of the ACM/IEEE
Systems, International Conference, 0:872–877, 2009. conference on Supercomputing, 2008.
[22] D. Thain and C. Moretti. Efficient access to many samall files in a
[5] BES Science Network Requirements, Report of the Basic Energy Sci- filesystem for grid computing. In Proceedings of the 8th IEEE/ACM
ences Network Requirements Workshop. Basic Energy Sciences Pro- International Conference on Grid Computing, GRID ’07, pages 243–
gram Office, DOE Office of Science and the Energy Sciences Net- 250, Washington, DC, USA, 2007. IEEE Computer Society.
work, 2007.
[23] W. Wu, P. Demar, and M. Crawford. Sorting reordered packets with
[6] E. W. Bethel. Visapult – A Prototype Remote and Distributed Applica- interrupt coalescing. Comput. Netw., 53:2646–2662, October 2009.
tion and Framework. In Proceedings of Siggraph 2000 – Applications
and Sketches. ACM/Siggraph, July 2000. [24] E. Yildirim, M. Balman, and T. Kosar. Dynamically tuning level of
parallelism in wide area data transfers. In Proceedings of the 2008
17
iperf: http://iperf.sourceforge.net/ international workshop on Data-aware distributed computing, DADC
18 ’08, pages 39–48. ACM, 2008.
nuttcp: http://www.nuttcp.net

View publication stats

You might also like