Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it a necessary resource to develop and evaluate tools to aid in the treatment of COVID-19.
30 PAPERS • NO BENCHMARKS YET
Yeast dataset consists of a protein-protein interaction network. Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology.
15 PAPERS • NO BENCHMARKS YET
The minimalist histopathology image analysis dataset (MHIST) is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each with a gold-standard label determined by the majority vote of seven board-certified gastrointestinal pathologists. MHIST also includes each image’s annotator agreement level. As a minimalist dataset, MHIST occupies less than 400 MB of disk space, and a ResNet-18 baseline can be trained to convergence on MHIST in just 6 minutes using approximately 3.5 GB of memory on a NVIDIA RTX 3090. As example use cases, the authors use MHIST to study natural questions that arise in histopathology image classification such as how dataset size, network depth, transfer learning, and high-disagreement examples affect model performance.
9 PAPERS • NO BENCHMARKS YET
2D HeLa is a dataset of fluorescence microscopy images of HeLa cells stained with various organelle-specific fluorescent dyes. The images include 10 organelles, which are DNA (Nuclei), ER (Endoplasmic reticulum), Giantin, (cis/medial Golgi), GPP130 (cis Golgi), Lamp2 (Lysosomes), Mitochondria, Nucleolin (Nucleoli), Actin, TfR (Endosomes), Tubulin. The purpose of the dataset is to train a computer program to automatically identify sub-cellular organelles.
5 PAPERS • NO BENCHMARKS YET
In the BB-norm modality of this task, participant systems had to normalize textual entity mentions according to the OntoBiotope ontology for habitats. See BB-dataset for more information.
5 PAPERS • 1 BENCHMARK
In the BB-norm modality of this task, participant systems had to normalize textual entity mentions according to the OntoBiotope ontology for phenotypes. See BB-dataset for more information.
The LIVECell (Label-free In Vitro image Examples of Cells) dataset is a large-scale microscopic image dataset for instance-segmentation of individual cells in 2D cell cultures.
4 PAPERS • 1 BENCHMARK
This work was undertaken by members of the Lincoln Centre for Autonomous Systems, University of Lincoln, UK. The four data collection sessions were conducted at three different sites in Lincolnshire, UK and one in Murcia, Spain (see Fig. 1). The sessions were conducted at the beginning and towards the end of harvesting season in UK and at the end of the harvest in Spain. The variety of broccoli plants grown in UK is called Iron Man whilst the variety grown in Spain is called Titanium.The weather during UK data capture included a mixture of different conditions including sunny, overcast and raining with broccoli varying in maturity levels from small to larger to already harvested, while the conditions for data capture in Spain included strong sunlight and mature plants at the very end of the harvesting season. The tractor was driven through the broccoli field at a slow walking speed with two rows of broccoli plants being imaged by the RGB-D sensor.
2 PAPERS • NO BENCHMARKS YET
The platelet-em dataset contains two 3D scanning electron microscope (EM) images of human platelets, as well as instance and semantic segmentations of those two image volumes. This data has been reviewed by NIBIB, contains no PII or PHI, and is cleared for public release. All files use a multipage uint16 TIF format. A 3D image with size [Z, X, Y] is saved as Z pages of size [X, Y]. Image voxels are approximately 40x10x10 nm
2 PAPERS • 2 BENCHMARKS
The complete blood count (CBC) dataset contains 360 blood smear images along with their annotation files splitting into Training, Testing, and Validation sets. The training folder contains 300 images with annotations. The testing and validation folder both contain 60 images with annotations. We have done some modifications over the original dataset to prepare this CBC dataset where some of the image annotation files contain very low red blood cells (RBCs) than actual and one annotation file does not include any RBC at all although the cell smear image contains RBCs. So, we clear up all the fallacious files and split the dataset into three parts. Among the 360 smear images, 300 blood cell images with annotations are used as the training set first, and then the rest of the 60 images with annotations are used as the testing set. Due to the shortage of data, a subset of the training set is used to prepare the validation set which contains 60 images with annotations.
The Focused Open Biology Information Extraction (FOBIE) dataset aims to support IE from Computer-Aided Biomimetics. The dataset contains ~1,500 sentences from scientific biological texts. These sentences are annotated with TRADE-OFFS and syntactically similar relations between unbounded arguments, as well as argument-modifiers.
NucMM is a dataset for segmenting 3D cell nuclei from microscopy image volumes that pushes the task forward to the sub-cubic millimeter scale. It consists of two fully annotated volumes: one electron microscopy (EM) volume containing nearly the entire zebrafish brain with around 170,000 nuclei; and one micro-CT (uCT) volume containing part of a mouse visual cortex with about 7,000 nuclei.
Overview This database of simulated arterial pulse waves is designed to be representative of a sample of pulse waves measured from healthy adults. It contains pulse waves for 4,374 virtual subjects, aged from 25-75 years old (in 10 year increments). The database contains a baseline set of pulse waves for each of the six age groups, created using cardiovascular properties (such as heart rate and arterial stiffness) which are representative of healthy subjects at each age group. It also contains 728 further virtual subjects at each age group, in which each of the cardiovascular properties are varied within normal ranges. This allows for extensive in silico analyses of haemodynamics and the performance of pulse wave analysis algorithms.
The dataset represents data generated from a commonly used model in population genetics. It comprises a matrix of 1,000,000 rows and 9 columns, representing parameters and summaries generated by an infinite-sites coalescent model for genetic variation. The first two columns encode the scaled mutation rate (theta) and scaled recombination rate (rho). The subsequent seven columns are data summaries: number of segregating sites (C1), standard uniform random noise acting as a distractor (C2), pairwise mean number of nucleotidic differences (C3), mean $R^2$ across pairs separated by <10% of the simulated genomic regions (C4), number of distinct haplotypes (C5), frequency of the most common haplotype (C6), number of singleton haplotypes (C7).
By releasing this dataset, we aim at providing a new testbed for computer vision techniques using Deep Learning. The main peculiarity is the shift from the domain of "natural images" proper of common benchmark dataset to biological imaging. We anticipate that the advantages of doing so could be two-fold: i) fostering research in biomedical-related fields - for which popular pre-trained models perform typically poorly - and ii) promoting methodological research in deep learning by addressing peculiar requirements of these images. Possible applications include but are not limited to semantic segmentation, object detection and object counting. The data consist of 283 high-resolution pictures (1600x1200 pixels) of mice brain slices acquired through a fluorescence microscope. The final goal is to individuate and count neurons highlighted in the pictures by means of a marker, so to assess the result of a biological experiment. The corresponding ground-truth labels were generated through a hy
CausalBench is a comprehensive benchmark suite for evaluating network inference methods on large-scale perturbational single-cell gene expression data. CausalBench introduces several biologically meaningful performance metrics and operates on two large, curated and openly available benchmark data sets for evaluating methods on the inference of gene regulatory networks from single-cell data generated under perturbations. The datasets consists of over 200000 training samples under interventions.
1 PAPER • NO BENCHMARKS YET
This dataset contains recordings of 32 sound producing insect species with a total 335 files and a length of 57 minutes. The dataset was compiled for training neural networks to automatically identify insect species while comparing adaptive, waveform-based frontends to conventional mel-spectrogram frontends for audio feature extraction. This work will be submitted for publication in the future and this dataset can be used to replicate the results, as well as other uses. The scripts for audio processing and the machine learning implementations will be published on Github.
The data used for all results in this paper can be found here. This directory contains:
The dataset X of this work is an extension of the heartSeg dataset. Each sample x ∈ X is an RGB image capturing the heart region of Medaka (Oryzias latipes) hatchlings from a constant ventral view. Since the body of Medaka is see-through, noninvasive studies regarding the internal organs and the whole circulatory system are practicable. A Medaka’s heart contains three parts: the atrium, the ventricle, and the bulbus. The atrium receives deoxygenated blood from the circulatory system and delivers it to the ventricle, which forwards it into the bulbus. The bulbus is the heart’s exit chamber and provides the gill arches with a constant blood flow. The blood flow through these three chambers was captured in 63 short recordings (around 11 seconds with 24 frames per second each) in total, from which the single image samples x ∈ X are extracted. The dataset is split into training and test data following the heartSeg dataset with ntrain = 565 samples in the training set Xtrain and ntest = 165
1 PAPER • 1 BENCHMARK
GO21 is a biomedical knowledge graph that models genes, proteins, drugs, and the hierarchy of the biological processes they participate in. It consists of 806,136 triples with 21 relations and 89127 entities. GO21 can be used for knowledge graph completion tasks (link prediction) as well as hierarchical reasoning tasks, such as ancestor-descendant prediction task proposed in the paper.
The H01 dataset is a 1.4 petabyte rendering of a small sample of human brain tissue, released by a collaboration between the Lichtman Laboratory at Harvard University and Google. The H01 sample was imaged at 4nm-resolution by serial section electron microscopy, reconstructed and annotated by automated computational techniques, and analyzed for preliminary insights into the structure of the human cortex.
This publicly available dataset contains 1613 RGB-D images of field-grown broccoli plants. The dataset also includes the polygon and circle annotations of the broccoli heads.
This data set contains weekly scans of cauliflower and broccoli covering a ten week growth cycle from transplant to harvest. The data set includes ground-truth, physical characteristics of the crop; environmental data collected by a weather station and a soil-senor network; and scans of the crop performed by an autonomous agricultural robot, which include stereo colour, thermal and hyperspectral imagery. The crop were planted at Lansdowne Farm, a University of Sydney agricultural research and teaching facility. Lansdowne Farm is located in Cobbitty, a suburb 70km south-west of Sydney in New South Wales (NSW), Australia. Four 80 metre raised crop beds were prepared with a North-South orientation. Approximately 144 Brassica were planted in each bed. Cauliflower were planted in the first and third bed (from west to east). Broccoli were planted in the second and fourth beds.
Marine Microalgae Detection in Microscopy Images dataset contains a total number of images in the dataset is 937 and all the objects in these images were annotated. The total number of annotated objects is 4201. The training set contains 537 images and the testing set contains 430 images.
TPM values together with cell type annotations that were obtained from Alex Pollen on 15/10/15
RxRx1 is a biological dataset designed specifically for the systematic study of batch effect correction methods. The dataset consists of 125,510 high-resolution fluorescence microscopy images of human cells under 1,138 genetic perturbations in 51 experimental batches across 4 cell types.
The ARPA-E funded TERRA-REF project is generating open-access reference datasets for the study of plant sensing, genomics, and phenomics. Sensor data were generated by a field scanner sensing platform that captures color, thermal, hyperspectral, and active flourescence imagery as well as three dimensional structure and associated environmental measurements. This dataset is provided alongside data collected using traditional field methods in order to support calibration and validation of algorithms used to extract plot level phenotypes from these datasets.
VISEM-Tracking is a dataset consisting of 20 video recordings of 30s of spermatozoa with manually annotated bounding-box coordinates and a set of sperm characteristics analyzed by experts in the domain. It is an extension of the previously published VISEM dataset. In addition to the annotated data, unlabeled video clips are provided for easy-to-use access and analysis of the data.
VesselGraph is a dataset of whole-brain vessel graphs based on specific imaging protocols. Specifically, vascular graphs are extracted using a refined graph extraction scheme leveraging the volume rendering engine Voreen and provided in an accessible and adaptable form through the OGB and PyTorch Geometric dataloaders.
We've made available several genome-wide datasets, which can be used for training microRNA (miRNA) classifiers. The hairpin sequences available are from the genomes of: Homo sapiens, Arabidopsis thaliana, Anopheles gambiae, Caenorhabditis elegans and Drosophila melanogaster. Hairpin.s are small RNA sequences that naturaly folds into a hairpin-structure. However, not all hairpins have clear function (they are not miRNAs).
0 PAPER • NO BENCHMARKS YET
The medaka (Oryzias latipes) and the zebrafish (Danio rerio) are used as a model organism for a variety of subjects in biomedical research. The presented work aims to study the potential of automated ventricular dimension estimation through heart segmentation in medaka. For more on this, it's time for a closer look on our paper and the supplementary materials.
Plankton was sampled with various nets, from bottom or 500m depth to the surface, in many oceans of the world. Samples were imaged with a ZooScan. The full images were processed with ZooProcess which generated regions of interest (ROIs) around each individual object and a set of associated features measured on the object (see Gorsky et al 2010 for more information). The same objects were re-processed to compute features with the scikit-image toolbox (http://scikit-image.org). The 1,433,278 resulting objects were sorted by a limited number of operators, following a common taxonomic guide, into 93 taxa, using the web application EcoTaxa (http://ecotaxa.obs-vlfr.fr).