Search

Article

We present the results of an evaluation of new features of the latest release of IBM's GPFS filesystem (v3.2). We investigate different ways of connecting to a high-performance GPFS filesystem from a remote cluster using Infiniband (IB) and 10 Gigabit Ethernet. We also examine the performance of the GPFS filesystem with both serial and parallel I/O. Finally, we also present our recommendations for effective ways of utilizing high-bandwidth networks for high-performance I/O to parallel file systems.

Cover page: Evaluation of GPFS Connectivity Over High-Performance Networks

Article
Peer Reviewed

Scaling Spark on HPC Systems

LBL Publications (2016)

We report our experiences porting Spark to large production HPC systems. While Spark performance in a data center installation (with local disks) is dominated by the network, our results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4× slower than a typical workstation. We evaluate a combination of software techniques and hardware configurations designed to address this problem. For example, on the software side we develop a file pooling layer able to improve per node performance up to 2.8×. On the hardware side we evaluate a system with a large NVRAM buffer between compute nodes and the backend Lustre file system: this improves scaling at the expense of per-node performance. Overall, our results indicate that scalability is currently limited to O(102) cores in a HPC installation with Lustre and default Spark. After careful configuration combined with our pooling we can scale up to O(104). As our analysis indicates, it is feasible to observe much higher scalability in the near future.

Cover page: Scaling Spark on HPC Systems

Peer Reviewed

The NERSC Cori HPC System

NERSC (2019)

Evaluation of GPFS Connectivity Over High-Performance Networks

Scaling Spark on HPC Systems

The NERSC Cori HPC System