*Result*: Julia versus C++ Kokkos for performance portable Cartesian CFD solvers on heterogeneous architectures.
*Further Information*
*Looking for high performance hydrocode simulations on heterogeneous architectures, we detail a performance portable implementation of a second-order accurate 2-D Cartesian explicit CFD solver using Julia's Just-in-Time (JIT) compilation. In this work, a custom abstraction layer is used targeting two Julia packages, Polyester.jl for efficient shared memory multithreading on CPUs and KernelAbstractions.jl for appropriate backends on GPUs. Using very same optimizations and data structures than those used with Julia, comparisons to static C++ Kokkos compilation are then provided, including speedups and energy consumptions on high-end CPUs and GPUs available mid-2022. Using a single 64-core CPU with a few million cells to benefit from cache effects in multithread mode, the Julia code (≈0.5 × 109 cell-cycles/s) is superior to its C++ Kokkos counterpart, with a very same lower limit (≈0.16 × 109 cell-cycles/s) for higher numbers of cells. Using one GPU, the C++ Kokkos implementation is slightly superior, the Julia implementation tending to the same upper limit (≈1.5 × 109 cell-cycles/s) when the GPU memory (40 GiB) is entirely used. With a small number of floating-point operations per cell and time step, Cartesian solvers are singular in the CFD landscape, such solvers being essentially memory bandwidth bound on both CPUs and GPUs. In this context, at the compute node level, the compute capability of the CPU(s) cannot be underestimated, with (much) more memory available per cell for multi-physics variables and - year over year - improved memory bandwidths, larger caches and higher floating-point capabilities. Indeed, for high performance computing (HPC) simulations involving many MPI processes, communications between compute nodes become significant and best efforts are requested to overlap communications with computations. The performance portable Julia implementation of the CFD solver presented here combines domain decomposition and directional splitting using a static scheduling approach. Benefits from asynchronous communications appear with 16 GPUs on 4 nodes. At best, on this small-size configuration, the GPU mode of the Julia performance portable code brings at full GPUs' memory capacity a factor of 14× in performance and a factor of 8× in device energy efficiency compared to the CPU mode. Such a work, among others, confirms the potential of the Julia programming language and its emerging HPC software stack, offering (i) the power of a scripting language, (ii) the performances of a compiled language, and perhaps even more importantly (iii) an access to a compilation toolchain with new opportunities for developers to tackle heterogeneous computing architectures. [ABSTRACT FROM AUTHOR]
Copyright of International Journal of High Performance Computing Applications is the property of Sage Publications Inc. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)*