Urban is one of the most widely used hyperspectral data used in the hyperspectral unmixing study. There are 307x307 pixels, each of which corresponds to a 2x2 m2 area. In this image, there are 210 wavelengths ranging from 400 nm to 2500 nm, resulting in a spectral resolution of 10 nm. After the channels 1-4, 76, 87, 101-111, 136-153 and 198-210 are removed (due to dense water vapor and atmospheric effects), we remain 162 channels (this is a common preprocess for hyperspectral unmixing analyses). There are three versions of ground truth, which contain 4, 5 and 6 endmembers respectively, which are introduced in the ground truth.
2 PAPERS โข NO BENCHMARKS YET
As part of our policy to openly share all data from this project, we have included a downloadable package comprising all acoustic data collected over the course of this work. This includes acoustic recordings from 20 different species of mosquitoes, using a variety of mobile phones for each. This data can be downloaded from the online repository on dryad.org. The supplementary audio files are not included in this package, and may be downloaded separately.
1 PAPER โข NO BENCHMARKS YET
ACL-Fig is a large-scale automatically annotated corpus consisting of 112,052 scientific figures extracted from 56K research papers in the ACL Anthology. The ACL-Fig-pilot dataset contains 1,671 manually labeled scientific figures belonging to 19 categories.
The dataset contains three subsets:
This database is a database of backdoored neural networks intended for face recognition. The networks are of the FaceNet architecture and are trained on Casia-WebFace, with and without additional samples (which are the source of the backdoor). More information regarding backdoors and the project within which this fits can be found in the public release of the source code : https://gitlab.idiap.ch/bob/bob.paper.backdoored_facenets.biosig2022.
The dataset contains a total of 253,070 records, with 18 features. The features are categorized into four different types: Metadata, Primary Data, Engagement Stats, and Label. Under the Metadata category contains basic information about the channel and video, such as their unique identifiers, date and time of publication, and thumbnail URLs. The Primary Data category contains information about the title and description of the video. The "Processed" columns refer to the cleaned data after denoising, deduplication and debiased for further analysis. The Engagement Stats category contains data on user engagement metrics for each video. The Label category contains predefined auto labels, human annotated labels, and AI generated pseudo labels. Auto labels are labels that are automatically derived based on a review of their titles, descriptions, and thumbnails over time. Channels with consistently misleading, exaggerated, or sensationalized content were labeled as clickbait. Those focusing on
The dataset contains 36000 Bangla data based on Ekman's six basic emotions. This data was first introduced in the paper Alternative non-BERT model choices for the textual classification in low-resource languages and environments. The whole dataset is balanced and evenly distributed among all the six classes.
1 PAPER โข 1 BENCHMARK
Several datasets are fostering innovation in higher-level functions for everyone, everywhere. By providing this repository, we hope to encourage the research community to focus on hard problems. In this repository, we present the real results severity (BIRADS) and pathology (post-report) classifications provided by the Radiologist Director from the Radiology Department of Hospital Fernando Fonseca while diagnosing several patients (see dataset-uta4-dicom) from our User Tests and Analysis 4 (UTA4) study. Here, we provide a dataset for the measurements of both severity (BIRADS) and pathology classifications concerning the patient diagnostic. Work and results are published on a top Human-Computer Interaction (HCI) conference named AVI 2020 (page). Results were analyzed and interpreted from our Statistical Analysis charts. The user tests were made in clinical institutions, where clinicians diagnose several patients for a Single-Modality vs Multi-Modality comparison. For example, in these t
Original images and images with RUSTICO filters applied
Dataset included measuring static tension under 2 kg load in different points of the CB and measurements in dynamic conditions. The latter conditions presumed the range of the linear belt speeds between nu_1 = 0.5 and nu_max = 1.7 m/s. 400 Hz unified sampling frequency for the experiments. It corresponded with 140 samples.
This is a set of 100,000 non-overlapping image patches from hematoxylin & eosin (H&E) stained histological images of human colorectal cancer (CRC) and normal tissue. All images are 224x224 pixels (px) at 0.5 microns per pixel (MPP). For tissue classification; the classes are: Adipose (ADI), background (BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR), colorectal adenocarcinoma epithelium (TUM). The images were manually extracted from N=86 H&E stained human cancer tissue slides from formalin-fixed paraffin-embedded (FFPE) samples from the NCT Biobank (National Center for Tumor Diseases, Heidelberg, Germany) and the UMM pathology archive (University Medical Center Mannheim, Mannheim, Germany). Tissue samples contained CRC primary tumor slides and tumor tissue from CRC liver metastases; normal tissue classes were augmented with non-tumorous regions from gastrectomy specimen to increase variability.
Histological images of colorectal cancer, derived from the TCGA database
CVE stands for Common Vulnerabilities and Exposures. CVE is a glossary that classifies vulnerabilities. The glossary analyzes vulnerabilities and then uses the Common Vulnerability Scoring System (CVSS) to evaluate the threat level of a vulnerability. A CVE score is often used for prioritizing the security of vulnerabilities.
A dataset of games played in the card game "Cards Against Humanity" (CAH), by human players, derived from the online CAH labs. Each round includes the cards presented to users - a "black" prompt with a blank or question and 10 "white" punchlines as possible responses, and which punchline was picked by a player each round, along with text and metadata.
A large dataset of color names and their respective RGB values stores in CSV.
Digitally Generated Numerals (DIGITal) Description The Digitally Generated Numerals (DIGITal) dataset consists of 100,000 image pairs representing digits from 0 to 9. These image pairs include both low and high-quality versions, with a resolution of 128x128 pixels.
The dataset comprises motion sensor data of 19 daily and sports activities each performed by 8 subjects in their own style for 5 minutes. Five Xsens MTx units are used on the torso, arms, and legs.
DeepGraviLens is a data set of simulated gravitational lenses consisting of images associated with brightness variation time series. In this dataset, both non-transient and transient phenomena (supernovae explosions) are simulated.
DeepParliament is a legal domain Benchmark Dataset that gathers bill documents and metadata and performs various bill status classification tasks. The dataset text covers a broad range of bills from 1986 to the present and contains richer information on parliament bill content. There are a total of 5329 documents where 4223 are in the train and 1106 are in the test dataset. Each bill document contains many sentences in both cases, and the documentโs length varies greatly.
Dissonance Twitter Dataset is a dataset collected from annotating tweets for dissonance.
FinBench is a benchmark for evaluating the performance of machine learning models with both tabular data inputs and profile text inputs.
The Food Recall Incidents dataset consists of 7,546 short texts (from 5 to 360 characters each), which are the titles of food recall announcements (therefore referred to as title), crawled from 24 public food safety authority websites by Agroknow. The texts are written in 6 languages, with English (6,644) and German (888) being the most common, followed by French (8), Greek (4), Italian (1) and Danish (1). Most of the texts have been authored after 2010 and they describe recalls of specific food products due to specific hazards. Experts manually classified each text to four groups of classes describing hazards and products on two levels of granularity:
FractureAtlas is a musculoskeletal bone fracture dataset with annotations for deep learning tasks like classification, localization, and segmentation. The dataset contains a total of 4,083 X-Ray images with annotation in COCO, VGG, YOLO, and Pascal VOC format. This dataset is made freely available for any purpose. The data provided within this work are free to copy, share or redistribute in any medium or format. The data might be adapted, remixed, transformed, and built upon. The dataset is licensed under a CC-BY 4.0 license. It should be noted that to use the dataset correctly, one needs to have knowledge of medical and radiology fields to understand the results and make conclusions based on the dataset. It's also important to consider the possibility of labeling errors.
We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text.
Gambling Address Dataset is a collection of 10,423 gambling addresses that have transactions with gambling contracts. Moreover, 51,004 non-gambling addresses are also selected (such as exchanges, wallet addresses, etc.), making the gambling address dataset more complete. In the dataset, accounts are used to refer to addresses (e.g. 0xd1ce...edec95), where 1, 0, and -1 represent the gamble, non-gamble, and other types, respectively.
Gambling Contract Dataset is a collection of 260 gambling smart contracts from decentralized gambling websites, such as Dicether, Degens. At the same time, in order to construct the negative samples required for training, 1040 smart contracts that are not involved in gambling (e.g., erc20, erc721, mixer, etc.) are selected . In the dataset, accounts are used to refer to contracts (e.g. 0x3fe2b...f8a33f), where 1, 0, and -1 to represent the gamble, non-gamble, and other types, respectively.
Dataset introduced by Xifeng Yan et al.
HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile robots operating in a changing environment (like a household), where it is important to learn new, never seen objects on the fly. This dataset can also be used for other learning use-cases, like instance segmentation or depth estimation. Or where household objects or continual learning are of interest.
1 PAPER โข 2 BENCHMARKS
The HRPlanesv2 dataset contains 2120 VHR Google Earth images. To further improve experiment results, images of airports from many different regions with various uses (civil/military/joint) selected and labeled. A total of 14,335 aircrafts have been labelled. Each image is stored as a ".jpg" file of size 4800 x 2703 pixels and each label is stored as YOLO ".txt" format. Dataset has been split in three parts as 70% train, %20 validation and test. The aircrafts in the images in the train and validation datasets have a percentage of 80 or more in size. Link: https://github.com/dilsadunsal/HRPlanesv2-Data-Set
The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel tasks of multimodal figurative understanding and preference.
This dataset was presented as part of the ICLR 2023 paper ๐ ๐ง๐ณ๐ข๐ฎ๐ฆ๐ธ๐ฐ๐ณ๐ฌ ๐ง๐ฐ๐ณ ๐ฃ๐ฆ๐ฏ๐ค๐ฉ๐ฎ๐ข๐ณ๐ฌ๐ช๐ฏ๐จ ๐๐ญ๐ข๐ด๐ด-๐ฐ๐ถ๐ต-๐ฐ๐ง-๐ฅ๐ช๐ด๐ต๐ณ๐ช๐ฃ๐ถ๐ต๐ช๐ฐ๐ฏ ๐ฅ๐ฆ๐ต๐ฆ๐ค๐ต๐ช๐ฐ๐ฏ ๐ข๐ฏ๐ฅ ๐ช๐ต๐ด ๐ข๐ฑ๐ฑ๐ญ๐ช๐ค๐ข๐ต๐ช๐ฐ๐ฏ ๐ต๐ฐ ๐๐ฎ๐ข๐จ๐ฆ๐๐ฆ๐ต.
This is a real-worldย industrial benchmark dataset from a major medical device manufacturer for the prediction of customer escalations. The dataset contains features derived from IoT (machine log) and enterprise data including labels for escalation from a fleet of thousands of customers of high-end medical devices.ย
Context The Kepler Space Observatory is a NASA-build satellite that was launched in 2009. The telescope is dedicated to searching for exoplanets in star systems besides our own, with the ultimate goal of possibly finding other habitable planets besides our own. The original mission ended in 2013 due to mechanical failures, but the telescope has nevertheless been functional since 2014 on a "K2" extended mission.
LEPISZCZE is an open-source comprehensive benchmark for Polish NLP and a continuous-submission leaderboard, concentrating public Polish datasets (existing and new) in specific tasks.
This data set comprises 22 fundus images with their corresponding manual annotations for the blood vessels, separated as arteries and veins. It also include labels for glaucomatous / healthy, differentiating between normal tension glaucoma (NAG) and primary open angle glaucoma (POAG).
LLeQA is a French native dataset for studying information retrieval and long-form question answering in the legal domain. It consists of a knowledge corpus of 27,941 statutory articles collected from the Belgian legislation, and 1,868 legal questions posed by Belgian citizens and labeled by experienced jurists with a comprehensive answer rooted in relevant articles from the corpus.
This data set contains 775 video sequences, captured in the wildlife park Lindenthal (Cologne, Germany) as part of the AMMOD project, using an Intel RealSense D435 stereo camera. In addition to color and infrared images, the D435 is able to infer the distance (or โdepthโ) to objects in the scene using stereo vision. Observed animals include various birds (at daytime) and mammals such as deer, goats, sheep, donkeys, and foxes (primarily at nighttime). A subset of 412 images is annotated with a total of 1038 individual animal annotations, including instance masks, bounding boxes, class labels, and corresponding track IDs to identify the same individual over the entire video.
The Mpox Close Skin Images dataset (MCSI) is a collection of skin images obtained from diverse public sources, that we accurately pre-processed (i.e., cropped and zoomed) in order to focus the skin lesion (if present), and to evaluate Machine Learning models aimed at detecting different pathologies from skin lesion pictures taken with smartphone cameras. It includes a total of 400 pictures homogeneously divided in 4 different classes: mpox, which contains samples of mpox (formerly Monkeypox) skin lesions; chickenpox, with samples of chickenpox cases; acne, containing samples of acne at different severity levels; and healthy, which contains samples of skin without any evident symptoms. This repository is part of the supplementary material accompanying the paper named: A Transfer Learning and Explainable Solution to Detect mpox from Smartphones images.
MapReader in GeoHumanities workshop (SIGSPATIAL 2022): Gold standards and outputs
MiST (Modals In Scientific Text) is a dataset containing 3737 modal instances in five scientific domains annotated for their semantic, pragmatic, or rhetorical function.
MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect patterns, a total of 38 defect patterns.
Dataset can be used by anyone who is interested to perform morphological classification of galaxies. Originally dataset provided by Kaggle user Jay Lin (https://www.kaggle.com/jay1985) 4 years ago. Dataset was used in conference paper "Morphological Classification of Galaxies Using SpinalNet"
Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. One of the major stumbling blocks for manual retinal examination is the lack of a sufficient number of qualified medical personnel per capita to diagnose diseases. Computer-aided diagnosis systems (CAD) have proven to be very effective in helping physicians reduce the time taken to make a diagnosis and minimize variability in image interpretation. Still, they are not flexible enough to accommodate the simultaneous presence of multiple retinal diseases, which is a common situation in real-world applications. In the past years, few datasets that focus on the classification of numerous retinal pathologies present at the same time, i.e., multi-label classification have been proposed, but there are some shared problems with all of them, such as a narrow range of pathologies to classify, high level of class imbalance, low amount of samples for the underrepresented
Neural fields (NeFs) have recently emerged as a versatile method for modeling signals of various modalities, including images, shapes, and scenes. Subsequently, many works have explored the use of NeFs as representations for downstream tasks, e.g. classifying an image based on the parameters of a NeF that has been fit to it. However, the impact of the NeF hyperparameters on their quality as downstream representation is scarcely understood and remains largely unexplored. This is partly caused by the large amount of time required to fit datasets of neural fields.
Hand-labelled dataset of crop and non-crop labels distributed throughout Nigeria with respective hd5f data arrays.
Onchocerciasis is causing blindness in over half a million people in the world today. Drug development for the disease is crippled as there is no way of measuring effectiveness of the drug without an invasive procedure. Drug efficacy measurement through assessment of viability of onchocerca worms requires the patients to undergo nodulectomy which is invasive, expensive, time-consuming, skill-dependent, infrastructure dependent and lengthy process.
Monitoring and evaluating of driving behavior is the main goal of this paper that encourage us to develop a new system based on Inertial Measurement Unit (IMU) sensors of smartphones. In this system, a hybrid of Discrete Wavelet Transformation (DWT) and Adaptive Neuro Fuzzy Inference System (ANFIS) is used to recognize overall driving behaviors. The behaviors are classified into the safe, the semi-aggressive, and the aggressive classes that are adopted with Driver Anger Scale (DAS) self-reported questionnaire results. The proposed system extracts four features from IMU sensors in the forms of time series. They are decomposed by DWT in two levels and their energies are sent to six ANFISs. Each ANFIS models the different perception about driving behavior under uncertain knowledge and returns the similarity or dissimilarity between driving behaviors. The results of these six ANFISs are combined by three different decision fusion approaches. Results show that Coiflet-2 is the most suitable