10 dataset results for Images AND Spanish

MSRC-12 (MSRC-12 Kinect Gesture Dataset)

The Microsoft Research Cambridge-12 Kinect gesture data set consists of sequences of human movements, represented as body-part locations, and the associated gesture to be recognized by the system. The data set includes 594 sequences and 719,359 frames—approximately six hours and 40 minutes—collected from 30 people performing 12 gestures. In total, there are 6,244 gesture instances. The motion files contain tracks of 20 joints estimated using the Kinect Pose Estimation pipeline. The body poses are captured at a sample rate of 30Hz with an accuracy of about two centimeters in joint positions.

31 PAPERS • 2 BENCHMARKS

IGLUE (Image-Grounded Language Understanding Evaluation)

The Image-Grounded Language Understanding Evaluation (IGLUE) benchmark brings together—by both aggregating pre-existing datasets and creating new ones—visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. The benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.

21 PAPERS • 13 BENCHMARKS

Synbols

Synbols is a dataset generator designed for probing the behavior of learning algorithms. By defining the distribution over latent factors one can craft a dataset specifically tailored to answer specific questions about a given algorithm.

11 PAPERS • NO BENCHMARKS YET

MuMiN

MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.

4 PAPERS • 3 BENCHMARKS

MultiSubs

MultiSubs (MultiSubs: A Large-scale Multimodal and Multilingual Dataset)

MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. We have supplemented some text fragments (visually salient nouns in this release) within the subtitles with web images, where the word sense of the fragment has been disambiguated using a cross-lingual approach. We have introduced a fill-in-the-blank task and a lexical translation task to demonstrate the utility of the dataset. Please refer to our paper for a more detailed description of the dataset and tasks. Multisubs will benefit research on visual grounding of words especially in the context of free-form sentence.

4 PAPERS • 5 BENCHMARKS

LSA16 (Lengua de Señas Argentina - 16 Handshapes classes)

This database contains images of 16 handshapes of the Argentinian Sign Language (LSA), each performed 5 times by 10 different subjects, for a total of 800 images. The subjects wore color hand gloves and dark clothes.

3 PAPERS • 1 BENCHMARK

MultiSense

MultiSense is a dataset of 9,504 images annotated with an English verb and its translation in Spanish and German.

3 PAPERS • NO BENCHMARKS YET

Tripadvisor Restaurant Reviews

Dataset of restaurant reviews from TripAdvisor that includes images and texts uploaded in reviews by users. Reviews in six different cities are included: Gijón (Spain), Barcelona (Spain), Madrid (Spain), New York City (USA), Paris (France) and London (United Kingdom). In the original publication, the following task is proposed: Can we explain, using the existing image or text from a different user, why a given restaurant was recommended to a certain user?

3 PAPERS • 6 BENCHMARKS

MuCo-VQA

MuCo-VQA consist of large-scale (3.7M) multilingual and code-mixed VQA datasets in multiple languages: Hindi (hi), Bengali (bn), Spanish (es), German (de), French (fr) and code-mixed language pairs: en-hi, en-bn, en-fr, en-de and en-es.

2 PAPERS • NO BENCHMARKS YET

GLAMI-1M (A Multilingual Image-Text Fashion Dataset)

We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text.

1 PAPER • 1 BENCHMARK

Datasets

10 dataset results for Images AND Spanish