Dataset of 50,000 top quark-antiquark (ttbar) events produced in proton-proton collisions at 14 TeV, overlaid with minimum bias events corresponding to a pileup of 200 on average. The dataset consists of detector hits as the input, generator particles as the ground truth and reconstructed particles from DELPHES for additional validation. The DELPHES model corresponds to a CMS-like detector with a multi-layered charged particle tracker, an electromagnetic and hadron calorimeter. Pythia8 and Delphes3 were used for the simulation.
4 PAPERS • NO BENCHMARKS YET
Dataset of high-pT jets from simulations of LHC proton-proton collisions
CMD is a publicly available collection of hundreds of thousands 2D maps and 3D grids containing different properties of the gas, dark matter, and stars from more than 2,000 different universes. The data has been generated from thousands of state-of-the-art (magneto-)hydrodynamic and gravity-only N-body simulations from the CAMELS project.
3 PAPERS • NO BENCHMARKS YET
JetClass is a new large-scale dataset to facilitate deep learning research in particle physics. It consists of 100M particle jets for training, 5M for validation and 20M for testing. The dataset contains 10 classes of jets, simulated with MadGraph + Pythia + Delphes. A detailed description of the JetClass dataset is presented in the paper Particle Transformer for Jet Tagging. An interface to use the dataset is provided here.
2 PAPERS • 1 BENCHMARK
Multirotor gym environment for learning control policies for various unmanned aerial vehicles.
2 PAPERS • NO BENCHMARKS YET
Contains data of parametric PDEs
RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established. This is a dataset accompanying the paper RL Unplugged: Benchmarks for Offline Reinforcement Learning.
The dataset consists in many runs of the same quantum circuit on different IBM quantum machines. We used 9 different machines and for each one of them, we run 2000 executions of the circuit. The circuit has 9 differents measurement steps along it. To obtain the 9 outcome distributions, for each execution, parts of the circuit are appended 9 times (in the same call to the IBM API, thus, in the shortest possible time) measuring a new step each time. The calls to the IBM API followed two different strategies. One was adopted to maximize the number of calls to the interface, parallelizing the code with as many possible runs and even running 8000 shots per run but considering for 8 times 1000 out of the memory to get the probabilities. The other strategy was slower, without parallelization and with a minimum waiting time between subsequent executions. The latter was adopted to get more uniformly distributed executions in time.
Numerical simulations of Earth's weather and climate require substantial amounts of computation. This has led to a growing interest in replacing subroutines that explicitly compute physical processes with approximate machine learning (ML) methods that are fast at inference time. Within weather and climate models, atmospheric radiative transfer (RT) calculations are especially expensive. This has made them a popular target for neural network-based emulators. However, prior work is hard to compare due to the lack of a comprehensive dataset and standardized best practices for ML benchmarking. To fill this gap, we build a large dataset, ClimART, with more than \emph{10 million samples from present, pre-industrial, and future climate conditions}, based on the Canadian Earth System Model. ClimART poses several methodological challenges for the ML community, such as multiple out-of-distribution test sets, underlying domain physics, and a trade-off between accuracy and inference speed.
1 PAPER • NO BENCHMARKS YET
This dataset is the outcome of a data challenge conducted as part of the Dark Machines Initiative and the Les Houches 2019 workshop on Physics at TeV colliders. The challenge aims at detecting signals of new physics at the LHC using unsupervised machine learning algorithms.
Neural network model files and Madgraph event generator outputs used as inputs to the results presented in the paper "Learning to discover: expressive Gaussian mixture models for multi-dimensional simulation and parameter inference in the physical sciences" arXiv:2108.11481; 2022 Mach. Learn.: Sci. Technol. 3 015021 Code and model files can be found at: https://github.com/darrendavidprice/science-discovery/tree/master/expressive_gaussian_mixture_models
JetNet is a particle cloud dataset, containing gluon, top quark, light quark jets saved in .csv format.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
SuperCaustics is a simulation tool made in Unreal Engine for generating massive computer vision datasets that include transparent objects.
1 PAPER • 1 BENCHMARK
pd4ml is a collection of datasets from fundamental physics research -- including particle physics, astroparticle physics, and hadron- and nuclear physics -- for supervised machine learning studies. These datasets, containing hadronic top quarks, cosmic-ray induced air showers, phase transitions in hadronic matter, and generator-level histories, are made public to simplify future work on cross-disciplinary machine learning and transfer learning in fundamental physics.
SynD is a synthetic energy dataset with a focus on residential buildings. This dataset is the result of a custom simulation process that relies on power traces of household appliances. The output of simulations is the power consumption of 21 household appliances as well as the household-wide consumption (i.e. mains). Therefore, SynD's can be used for Non-Intrusive Load Monitoring, also referred to as Energy Disaggregation.
0 PAPER • NO BENCHMARKS YET