Analyzing billion-objects catalog interactively: Apache Spark for physicists

9 Jul 2018  ·  S. Plaszczynski, J. Peloton, C. Arnault, J. E. Campagne ·

Apache Spark is a big-data framework for working on large distributed datasets. Although widely used in the industry, it remains confidential in the scientific community or often restricted to software engineers. The goal of this paper is to introduce the framework to newcomers and show that the technology is mature enough to be used without excessive programming skills also by physicists as astronomers or cosmologists to perform analyses over large datasets as those originating from future galactic surveys. To demonstrate it, we start from a realistic simulation corresponding to 10 years of LSST data-taking (6 billions of galaxies). Then we design, optimize and benchmark a set of Spark python algorithms in order to perform standard operations as adding photometric redshift errors, measuring the selection function or computing power spectra over tomographic bins. Most of the commands executes on the full 110 GB dataset within tens of seconds and can therefore be performed interactively in order to design full-scale cosmological analyses.

PDF Abstract

Categories


Instrumentation and Methods for Astrophysics Cosmology and Nongalactic Astrophysics