Click to add a brief description of the dataset (Markdown and LaTeX enabled).
3 PAPERS • NO BENCHMARKS YET
The availability of well-curated datasets has driven the success of Machine Learning (ML) models. Despite greater access to earth observation data in agriculture, there is a scarcity of curated and labelled datasets, which limits the potential of its use in training ML models for remote sensing (RS) in agriculture. To this end, we introduce a first-of-its-kind dataset called SICKLE, which constitutes a time-series of multi-resolution imagery from 3 distinct satellites: Landsat-8, Sentinel-1 and Sentinel-2. Our dataset constitutes multi-spectral, thermal and microwave sensors during January 2018 - March 2021 period. We construct each temporal sequence by considering the cropping practices followed by farmers primarily engaged in paddy cultivation in the Cauvery Delta region of Tamil Nadu, India; and annotate the corresponding imagery with key cropping parameters at multiple resolutions (i.e. 3m, 10m and 30m). Our dataset comprises 2, 370 season-wise samples from 388 unique plots, having
1 PAPER • 5 BENCHMARKS
This package provides utilities for generation, filtering, solving, visualizing, and processing of mazes for training ML systems. Primarily built for the maze-transformer interpretability project. You can find our paper on it here: http://arxiv.org/abs/2309.10498
2 PAPERS • NO BENCHMARKS YET
Overview: This dataset encompasses a compilation of 6,700 executed scoops (excavations), mapped across a vast spectrum of materials, terrain topography, and compositions.
1 PAPER • NO BENCHMARKS YET
The RoseBlooming dataset is a stage-specific flower dataset for detection. The dataset, consisting of overhead images, contains two rose cultivars and was filmed over a period of months. The dataset has 519 images, and most of the images contain several bounding boxes. Therefore, this dataset contains over 7,000 bounding boxes. The developmental stages of flowering branches were visually classified and annotated into two stages: rose_small, and rose_large. For the rose variation, the dataset includes 2 rose cultivars (‘Samourai 08’ and ‘Blossom Pink’ roses). The dataset contains images under various weather conditions.
Sonicverse is a multisensory simulation platform with integrated audio-visual simulation for training household agents that can both see and hear. Sonicverse models realistic continuous audio rendering in 3D environments in real-time. Together with a new audio-Visual VR interface that allows humans to interact with agents with audio, Sonicverse enables a series of embodied AI tasks that need audio-visual perception.
A dataset composed of 308 medium sized fires from the years 2018-2021, complete with both time series airborne based inference and ground operational estimation of fire extent, and operational mitigation data such as control line construction.
RoboPianist is a benchmarking suite for high-dimensional control, targeted at testing high spatial and temporal precision, coordination, and planning, all with an underactuated system frequently making-and-breaking contacts. The proposed challenge is mastering the piano through bi-manual dexterity, using a pair of simulated anthropomorphic robot hands. The initial version covers a broad set of 150 variable-difficulty songs.
FluidLab is a simulation environment with a diverse set of manipulation tasks involving complex fluid dynamics. These tasks address interactions between solid and fluid as well as among multiple fluids.
A multivariate spatio-temporal benchmark dataset for meteorological forecasting based on real-time observation data from ground weather stations.
8 PAPERS • 16 BENCHMARKS
ManiSkill2 is the next generation of the SAPIEN ManiSkill benchmark, to address critical pain points often encountered by researchers when using benchmarks for generalizable manipulation skills. It includes 20 manipulation task families with 2000+ object models and 4M+ demonstration frames, which cover stationary/mobile-base, single/dual-arm, and rigid/soft-body manipulation tasks with 2D/3D input data simulated by fully dynamic engines.
23 PAPERS • NO BENCHMARKS YET
PushWorld is an environment with simplistic physics that requires manipulation planning with both movable obstacles and tools. It contains more than 200 PushWorld puzzles in PDDL and in an OpenAI Gym environment.
The National Health and Nutrition Examination Survey (NHANES) provides data on the health and environmental exposure of the non-institutionalized US population. Such data have considerable potential to understand how the environment and behaviors impact human health. These data are also currently leveraged to answer public health questions such as prevalence of disease. However, these data need to first be processed before new insights can be derived through large-scale analyses. NHANES data are stored across hundreds of files with multiple inconsistencies. Correcting such inconsistencies takes systematic cross examination and considerable efforts but is required for accurately and reproducibly characterizing the associations between the exposome and diseases (e.g., cancer mortality outcomes). Thus, we developed a set of curated and unified datasets and accompanied code by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-20
SMACv2 (StarCraft Multi-Agent Challenge v2) is a new version of the benchmark where scenarios are procedurally generated and require agents to generalise to previously unseen settings (from the same distribution) during evaluation.
14 PAPERS • NO BENCHMARKS YET
MO-Gymnasium is an open source Python library for developing and comparing multi-objective reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. Essentially, the environments follow the standard Gymnasium API, but return vectorized rewards as numpy arrays.
6 PAPERS • NO BENCHMARKS YET
lilGym is a benchmark for language-conditioned reinforcement learning in visual environment based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty.
Memory Maze is a 3D domain of randomized mazes designed for evaluating the long-term memory abilities of RL agents. Memory Maze isolates long-term memory from confounding challenges, such as exploration, and requires remembering several pieces of information: the positions of objects, the wall layout, and keeping track of agent’s own position.
Situated Dialogue Navigation (SDN) is a navigation benchmark of 183 trials with a total of 8415 utterances, around 18.7 hours of control streams, and 2.9 hours of trimmed audio. SDN is developed to evaluate the agent's ability to predict dialogue moves from humans as well as generate its own dialogue moves and physical navigation actions.
This data set contains over 600GB of multimodal data from a Mars analog mission, including accurate 6DoF outdoor ground truth, indoor-outdoor transitions with continuous cross-domain ground truth, and indoor data with Optitrack measurements as ground truth. With 26 flights and a combined distance of 2.5km, this data set provides you with various distinct challenges for testing and proofing your algorithms. The UAV carries 18 sensors, including a high-resolution navigation camera and a stereo camera with an overlapping field of view, two RTK GNSS sensors with centimeter accuracy, as well as three IMUs, placed at strategic locations: Hardware dampened at the center, off-center with a lever arm, and a 1kHz IMU rigidly attached to the UAV (in case you want to work with unfiltered data). The sensors are fully pre-calibrated, and the data set is ready to use. However, if you want to use your own calibration algorithms, then the raw calibration data is also ready for download. The cross-domai
The MUAD dataset (Multiple Uncertainties for Autonomous Driving), consisting of 10,413 realistic synthetic images with diverse adverse weather conditions (night, fog, rain, snow), out-of-distribution objects, and annotations for semantic segmentation, depth estimation, object, and instance detection. Predictive uncertainty estimation is essential for the safe deployment of Deep Neural Networks in real-world autonomous systems and MUAD allows to a better assess the impact of different sources of uncertainty on model performance.
POPGym is designed to benchmark memory in deep reinforcement learning. It contains a set of environments and a collection of memory model baselines. The environments are all Partially Observable Markov Decision Process (POMDP) environments following the Openai Gym interface. Our environments follow a few basic tenets:
2 PAPERS • 1 BENCHMARK
This dataset contains recordings of 32 sound producing insect species with a total 335 files and a length of 57 minutes. The dataset was compiled for training neural networks to automatically identify insect species while comparing adaptive, waveform-based frontends to conventional mel-spectrogram frontends for audio feature extraction. This work will be submitted for publication in the future and this dataset can be used to replicate the results, as well as other uses. The scripts for audio processing and the machine learning implementations will be published on Github.
The StarCraft Multi-Agent Challenges+ requires agents to learn completion of multi-stage tasks and usage of environmental factors without precise reward functions. The previous challenges (SMAC) recognized as a standard benchmark of Multi-Agent Reinforcement Learning are mainly concerned with ensuring that all agents cooperatively eliminate approaching adversaries only through fine manipulation with obvious reward functions. This challenge, on the other hand, is interested in the exploration capability of MARL algorithms to efficiently learn implicit multi-stage tasks and environmental factors as well as micro-control. This study covers both offensive and defensive scenarios. In the offensive scenarios, agents must learn to first find opponents and then eliminate them. The defensive scenarios require agents to use topographic features. For example, agents need to position themselves behind protective structures to make it harder for enemies to attack.
11 PAPERS • 13 BENCHMARKS
A Simulated Benchmark for multi-modal SLAM Systems Evaluation in Large-scale Dynamic Environments.
Multi-agent pursuit in matrix world (pursuitMW) is a partially observable Markov game (POMG) between a swarm of pursuers and a swarm of evaders. Algorithms can be developed for the pursuers, evaders, or both of them.
The dataset contains constructed multi-modal features (visual and textual), pseudo-labels (on heritage values and attributes), and graph structures (with temporal, social, and spatial links) constructed using User-Generated Content data collected from Flickr social media platform in three global cities containing UNESCO World Heritage property (Amsterdam, Suzhou, Venice). The motivation of data collection in this project is to provide datasets that could be both directly applicable for ML communities as test-bed, and theoretically informative for heritage and urban scholars to draw conclusions on for planning decision-making.
RTMV is a large-scale synthetic dataset for novel view synthesis consisting of ∼300k images rendered from nearly 2000 complex scenes using high-quality ray tracing at high resolution (1600 × 1600 pixels). The dataset is orders of magnitude larger than existing synthetic datasets for novel view synthesis, thus providing a large unified benchmark for both training and evaluation. Using 4 distinct sources of high-quality 3D meshes, the scenes of our dataset exhibit challenging variations in camera views, lighting, shape, materials, and textures.
12 PAPERS • 1 BENCHMARK
Archive of Global Tropical Cyclone Tracks Tracks from 1980 to May 2019.
Kubric is a data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.
43 PAPERS • 1 BENCHMARK
Wastewater catchment area data are essential for wastewater treatment capacity planning and have recently become critical for operationalising wastewater-based epidemiology (WBE) for COVID-19. Owing to the privatised nature of the water industry in the United Kingdom, the required catchment area datasets are not readily available to researchers. Here, we present a consolidated dataset of 7,537 catchment areas from ten sewerage service providers in the Great Britain, covering more than 96% of the population of England and Wales.
EvoGym is a large-scale benchmark for co-optimizing the design and control of soft robots.
7 PAPERS • NO BENCHMARKS YET
Unsustainable fishing practices worldwide pose a major threat to marine resources and ecosystems. Identifying vessels that do not show up in conventional monitoring systems---known as ``dark vessels''---is key to managing and securing the health of marine environments. With the rise of satellite-based synthetic aperture radar (SAR) imaging and modern machine learning (ML), it is now possible to automate detection of dark vessels day or night, under all-weather conditions. SAR images, however, require a domain-specific treatment and are not widely accessible to the ML community. Maritime objects (vessels and offshore infrastructure) are relatively small and sparse, challenging traditional computer vision approaches. We present the largest labeled dataset for training ML models to detect and characterize vessels and ocean structures in SAR imagery. xView3-SAR consists of nearly 1,000 analysis-ready SAR images from the Sentinel-1 mission that are, on average, 29,400-by-24,400 pixels each.
Numerical simulations of Earth's weather and climate require substantial amounts of computation. This has led to a growing interest in replacing subroutines that explicitly compute physical processes with approximate machine learning (ML) methods that are fast at inference time. Within weather and climate models, atmospheric radiative transfer (RT) calculations are especially expensive. This has made them a popular target for neural network-based emulators. However, prior work is hard to compare due to the lack of a comprehensive dataset and standardized best practices for ML benchmarking. To fill this gap, we build a large dataset, ClimART, with more than \emph{10 million samples from present, pre-industrial, and future climate conditions}, based on the Canadian Earth System Model. ClimART poses several methodological challenges for the ML community, such as multiple out-of-distribution test sets, underlying domain physics, and a trade-off between accuracy and inference speed.
4 PAPERS • NO BENCHMARKS YET
The bipedal skills benchmark is a suite of reinforcement learning environments implemented for the MuJoCo physics simulator. It aims to provide a set of tasks that demand a variety of motor skills beyond locomotion, and is intended for evaluating skill discovery and hierarchical learning methods. The majority of tasks exhibit a sparse reward structure.
Symbolic Interactive Language Grounding (SILG) is a multi-environment benchmark which unifies a collection of diverse grounded language learning environments under a common interface. SILG consists of grid-world environments that require generalization to new dynamics, entities, and partially observed worlds (RTFM, Messenger, NetHack), as well as symbolic counterparts of visual worlds that require interpreting rich natural language with respect to complex scenes (ALFWorld, Touchdown). Together, these environments provide diverse grounding challenges in richness of observation space, action space, language specification, and plan complexity.
RGB-Stacking is a benchmark for vision-based robotic manipulation. The robot is trained to learn how to grasp objects and balance them on top of one another.
11 PAPERS • 3 BENCHMARKS
CARL (context adaptive RL) provides highly configurable contextual extensions to several well-known RL environments. It's designed to test your agent's generalization capabilities in all scenarios where intra-task generalization is important.
Robots operating in human spaces must be able to engage in natural language interaction with people, both understanding and executing instructions, and using conversation to resolve ambiguity and recover from mistakes. To study this, we introduce TEACh, a dataset of over 3,000 human--human, interactive dialogues to complete household tasks in simulation. A Commander with access to oracle information about a task communicates in natural language with a Follower. The Follower navigates through and interacts with the environment to complete tasks varying in complexity from "Make Coffee" to "Prepare Breakfast", asking questions and getting additional information from the Commander. We propose three benchmarks using TEACh to study embodied intelligence challenges, and we evaluate initial models' abilities in dialogue understanding, language grounding, and task execution.
25 PAPERS • NO BENCHMARKS YET
Bridge Data is a large multi-domain and multi-task dataset, with 7,200 demonstrations constituting 71 tasks across 10 environments. The dataset is collected using a low-cost yet versatile 6-DoF WidowX250 robot arm and contains 7,200 demonstrations of a robot performing 71 kitchen tasks across 10 environments with varying lighting, robot positions, and backgrounds. It can be used to boosting generalization of robotic skills and empirically study how it can improve the learning of new tasks in new environments.
8 PAPERS • NO BENCHMARKS YET
Collected in the snow belt region of Michigan's Upper Peninsula, WADS is the first multi-modal dataset featuring dense point-wise labeled sequential LiDAR scans collected in severe winter weather.
safe-control-gym is an open-source benchmark suite that extends OpenAI's Gym API with (i) the ability to specify (and query) symbolic models and constraints and (ii) introduce simulated disturbances in the control inputs, measurements, and inertial properties. We provide implementations for three dynamic systems -- the cart-pole, 1D, and 2D quadrotor -- and two control tasks -- stabilization and trajectory tracking.
Main Dataset city_pollution_data.csv
Phy-Q is a benchmark that requires an agent to reason about physical scenarios and take an action accordingly. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, the authors identify 15 essential physical scenarios. For each scenario, a wide variety of distinct task templates are created, and all the task templates within the same scenario can be solved by using one specific physical rule.
SPACE is a simulator for physical Interactions and causal learning in 3D environments. The SPACE simulator is used to generate the SPACE dataset, a synthetic video dataset in a 3D environment, to systematically evaluate physics-based models on a range of physical causal reasoning tasks. Inspired by daily object interactions, the SPACE dataset comprises videos depicting three types of physical events: containment, stability and contact.
5 PAPERS • 1 BENCHMARK
iGibson 2.0 is an open-source simulation environment that supports the simulation of a more diverse set of household tasks through three key innovations. First, iGibson 2.0 supports object states, including temperature, wetness level, cleanliness level, and toggled and sliced states, necessary to cover a wider range of tasks. Second, iGibson 2.0 implements a set of predicate logic functions that map the simulator states to logic states like Cooked or Soaked. Additionally, given a logic state, iGibson 2.0 can sample valid physical states that satisfy it. This functionality can generate potentially infinite instances of tasks with minimal effort from the users. The sampling mechanism allows our scenes to be more densely populated with small objects in semantically meaningful locations. Third, iGibson 2.0 includes a virtual reality (VR) interface to immerse humans in its scenes to collect demonstrations.
10 PAPERS • NO BENCHMARKS YET
ManiSkill is a large-scale learning-from-demonstrations benchmark for articulated object manipulation with visual input (point cloud and image). ManiSkill supports object-level variations by utilizing a rich and diverse set of articulated objects, and each task is carefully designed for learning manipulations on a single category of objects. ManiSkill is equipped with high-quality demonstrations to facilitate learning-from-demonstrations approaches and perform evaluations on common baseline algorithms. ManiSkill can encourage the robot learning community to explore more on learning generalizable object manipulation skills.
SuperCaustics is a simulation tool made in Unreal Engine for generating massive computer vision datasets that include transparent objects.