🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

51 dataset results for Biology

Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it a necessary resource to develop and evaluate tools to aid in the treatment of COVID-19.

31 PAPERS • NO BENCHMARKS YET

Yeast

Yeast dataset consists of a protein-protein interaction network. Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology.

17 PAPERS • NO BENCHMARKS YET

MHIST (Minimalist Histopathology image analysis dataset)

The minimalist histopathology image analysis dataset (MHIST) is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each with a gold-standard label determined by the majority vote of seven board-certified gastrointestinal pathologists. MHIST also includes each image’s annotator agreement level. As a minimalist dataset, MHIST occupies less than 400 MB of disk space, and a ResNet-18 baseline can be trained to convergence on MHIST in just 6 minutes using approximately 3.5 GB of memory on a NVIDIA RTX 3090. As example use cases, the authors use MHIST to study natural questions that arise in histopathology image classification such as how dataset size, network depth, transfer learning, and high-disagreement examples affect model performance.

16 PAPERS • NO BENCHMARKS YET

LIVECell (Label-free In Vitro image Examples of Cells)

The LIVECell (Label-free In Vitro image Examples of Cells) dataset is a large-scale microscopic image dataset for instance-segmentation of individual cells in 2D cell cultures.

14 PAPERS • 1 BENCHMARK

FLIP

FLIP (Fitness Landscape Inference for Proteins)

FLIP includes several benchmark datasets that contain a variety of protein sequences, each with a real-valued label indicating its "fitness" (how well the protein performs some particular function). The goal is to predict the fitness of a given protein sequence using the sequence. Different representations of protein sequences (e.g. learned embeddings from large language models) may prove helpful here.

9 PAPERS • NO BENCHMARKS YET

2D Hela

2D HeLa is a dataset of fluorescence microscopy images of HeLa cells stained with various organelle-specific fluorescent dyes. The images include 10 organelles, which are DNA (Nuclei), ER (Endoplasmic reticulum), Giantin, (cis/medial Golgi), GPP130 (cis Golgi), Lamp2 (Lysosomes), Mitochondria, Nucleolin (Nucleoli), Actin, TfR (Endosomes), Tubulin. The purpose of the dataset is to train a computer program to automatically identify sub-cellular organelles.

5 PAPERS • NO BENCHMARKS YET

BB-norm-habitat

BB-norm-habitat (Bacteria Biotope - entity normalization - bacterial habitat)

In the BB-norm modality of this task, participant systems had to normalize textual entity mentions according to the OntoBiotope ontology for habitats. See BB-dataset for more information.

5 PAPERS • 1 BENCHMARK

BB-norm-phenotype

BB-norm-phenotype (Bacteria Biotope - entity normalization - phenotype)

In the BB-norm modality of this task, participant systems had to normalize textual entity mentions according to the OntoBiotope ontology for phenotypes. See BB-dataset for more information.

5 PAPERS • 1 BENCHMARK

CBC (Complete Blood Count)

The complete blood count (CBC) dataset contains 360 blood smear images along with their annotation files splitting into Training, Testing, and Validation sets. The training folder contains 300 images with annotations. The testing and validation folder both contain 60 images with annotations. We have done some modifications over the original dataset to prepare this CBC dataset where some of the image annotation files contain very low red blood cells (RBCs) than actual and one annotation file does not include any RBC at all although the cell smear image contains RBCs. So, we clear up all the fallacious files and split the dataset into three parts. Among the 360 smear images, 300 blood cell images with annotations are used as the training set first, and then the rest of the 60 images with annotations are used as the testing set. Due to the shortage of data, a subset of the training set is used to prepare the validation set which contains 60 images with annotations.

5 PAPERS • NO BENCHMARKS YET

NucMM

NucMM is a dataset for segmenting 3D cell nuclei from microscopy image volumes that pushes the task forward to the sub-cubic millimeter scale. It consists of two fully annotated volumes: one electron microscopy (EM) volume containing nearly the entire zebrafish brain with around 170,000 nuclei; and one micro-CT (uCT) volume containing part of a mouse visual cortex with about 7,000 nuclei.

5 PAPERS • NO BENCHMARKS YET

RxRx1

RxRx1 is a biological dataset designed specifically for the systematic study of batch effect correction methods. The dataset consists of 125,510 high-resolution fluorescence microscopy images of human cells under 1,138 genetic perturbations in 51 experimental batches across 4 cell types.

5 PAPERS • NO BENCHMARKS YET

Multi-Label Classification Dataset Repository

For each dataset we provide a short description as well as some characterization metrics. It includes the number of instances (m), number of attributes (d), number of labels (q), cardinality (Card), density (Dens), diversity (Div), average Imbalance Ratio per label (avgIR), ratio of unconditionally dependent label pairs by chi-square test (rDep) and complexity, defined as m × q × d as in [Read 2010]. Cardinality measures the average number of labels associated with each instance, and density is defined as cardinality divided by the number of labels. Diversity represents the percentage of labelsets present in the dataset divided by the number of possible labelsets. The avgIR measures the average degree of imbalance of all labels, the greater avgIR, the greater the imbalance of the dataset. Finally, rDep measures the proportion of pairs of labels that are dependent at 99% confidence. A broader description of all the characterization metrics and the used partition methods are described in

4 PAPERS • NO BENCHMARKS YET

CausalBench

CausalBench is a comprehensive benchmark suite for evaluating network inference methods on large-scale perturbational single-cell gene expression data. CausalBench introduces several biologically meaningful performance metrics and operates on two large, curated and openly available benchmark data sets for evaluating methods on the inference of gene regulatory networks from single-cell data generated under perturbations. The datasets consists of over 200000 training samples under interventions.

3 PAPERS • NO BENCHMARKS YET

PWDB (Pulse Wave Database)

Overview This database of simulated arterial pulse waves is designed to be representative of a sample of pulse waves measured from healthy adults. It contains pulse waves for 4,374 virtual subjects, aged from 25-75 years old (in 10 year increments). The database contains a baseline set of pulse waves for each of the six age groups, created using cardiovascular properties (such as heart rate and arterial stiffness) which are representative of healthy subjects at each age group. It also contains 728 further virtual subjects at each age group, in which each of the cardiovascular properties are varied within normal ranges. This allows for extensive in silico analyses of haemodynamics and the performance of pulse wave analysis algorithms.

3 PAPERS • NO BENCHMARKS YET

3D Datasets of Broccoli in the Field

This work was undertaken by members of the Lincoln Centre for Autonomous Systems, University of Lincoln, UK. The four data collection sessions were conducted at three different sites in Lincolnshire, UK and one in Murcia, Spain (see Fig. 1). The sessions were conducted at the beginning and towards the end of harvesting season in UK and at the end of the harvest in Spain. The variety of broccoli plants grown in UK is called Iron Man whilst the variety grown in Spain is called Titanium.The weather during UK data capture included a mixture of different conditions including sunny, overcast and raining with broccoli varying in maturity levels from small to larger to already harvested, while the conditions for data capture in Spain included strong sunlight and mature plants at the very end of the harvesting season. The tractor was driven through the broccoli field at a slow walking speed with two rows of broccoli plants being imaged by the RGB-D sensor.

2 PAPERS • NO BENCHMARKS YET

3D Platelet EM

3D Platelet EM (Platelet Electron Microscopy)

The platelet-em dataset contains two 3D scanning electron microscope (EM) images of human platelets, as well as instance and semantic segmentations of those two image volumes. This data has been reviewed by NIBIB, contains no PII or PHI, and is cleared for public release. All files use a multipage uint16 TIF format. A 3D image with size [Z, X, Y] is saved as Z pages of size [X, Y]. Image voxels are approximately 40x10x10 nm

2 PAPERS • 2 BENCHMARKS

CREMP

CREMP is a resource generated for the rapid development and evaluation of machine learning models for macrocyclic peptides. CREMP contains 36,198 unique macrocyclical peptides and their high-quality structural ensembles generated using the Conformer-Rotamer Ensemble Sampling Tool (CREST).

2 PAPERS • NO BENCHMARKS YET

FOBIE (Focused Open Biological Information Extraction)

The Focused Open Biology Information Extraction (FOBIE) dataset aims to support IE from Computer-Aided Biomimetics. The dataset contains ~1,500 sentences from scientific biological texts. These sentences are annotated with TRADE-OFFS and syntactically similar relations between unbounded arguments, as well as argument-modifiers.

2 PAPERS • NO BENCHMARKS YET

H01

The H01 dataset is a 1.4 petabyte rendering of a small sample of human brain tissue, released by a collaboration between the Lichtman Laboratory at Harvard University and Google. The H01 sample was imaged at 4nm-resolution by serial section electron microscopy, reconstructed and annotated by automated computational techniques, and analyzed for preliminary insights into the structure of the human cortex.

2 PAPERS • NO BENCHMARKS YET

Summaries of genetic variation

The dataset represents data generated from a commonly used model in population genetics. It comprises a matrix of 1,000,000 rows and 9 columns, representing parameters and summaries generated by an infinite-sites coalescent model for genetic variation. The first two columns encode the scaled mutation rate (theta) and scaled recombination rate (rho). The subsequent seven columns are data summaries: number of segregating sites (C1), standard uniform random noise acting as a distractor (C2), pairwise mean number of nucleotidic differences (C3), mean $R^2$ across pairs separated by <10% of the simulated genomic regions (C4), number of distinct haplotypes (C5), frequency of the most common haplotype (C6), number of singleton haplotypes (C7).

2 PAPERS • NO BENCHMARKS YET

fluocells (Fluorescent Neuronal Cells)

By releasing this dataset, we aim at providing a new testbed for computer vision techniques using Deep Learning. The main peculiarity is the shift from the domain of "natural images" proper of common benchmark dataset to biological imaging. We anticipate that the advantages of doing so could be two-fold: i) fostering research in biomedical-related fields - for which popular pre-trained models perform typically poorly - and ii) promoting methodological research in deep learning by addressing peculiar requirements of these images. Possible applications include but are not limited to semantic segmentation, object detection and object counting. The data consist of 283 high-resolution pictures (1600x1200 pixels) of mice brain slices acquired through a fluorescence microscope. The final goal is to individuate and count neurons highlighted in the pictures by means of a marker, so to assess the result of a biological experiment. The corresponding ground-truth labels were generated through a hy

2 PAPERS • NO BENCHMARKS YET

neuronIO (Single cortical neuron (L5PC) input output simulation at 1ms temporal resolution)

Single cortical neurons as deep artificial neural networks This dataset contains training and testing subsets of the input/output relationship of a single cortical layer 5 pyramidal cell (L5PC) neuron at 1ms single spike temporal resolution. The data is obtained via a simulation that contains all of the currently (2021) known and well modeled "messy biological details" that relate to the operation of single neurons in the brain.

2 PAPERS • 1 BENCHMARK

16s rDNA sequencing of feces from C9orf72 loss of function mice

In one round of sequencing, 5 fecal pellets from 2 pro-inflammatory environments (Harvard BRI/Johns Hopkins) and 2 pro-survival environments (Broad Institute/Jackson Labs) were sequenced at the 16s rDNA locus. In a second round of sequencing, 9 fecal pellets from Harvard BRI, 9 fecal pellets from Broad Institute, 6 fecal pellets from Harvard BRI mice transplanted with Harvard BRI feces, and 6 pellets from Harvard BRI mice transplanted with Broad feces were sequenced at the 16S rDNA locus

1 PAPER • NO BENCHMARKS YET

3D-POP

The dataset is designed specifically to solve a range of computer vision problems (2D-3D tracking, posture) faced by biologists while designing behavior studies with animals.

1 PAPER • NO BENCHMARKS YET

ACCT Data Repository (ACCT is a fast and accessible automatic cell counting tool using machine learning for 2D image segmentation)

This dataset is a collection of fluorescent images from mice in order to test an automatic cell counting tool that we developed. 62 images viewed from 2 or 3 different fields of views are shown. In brief, the dataset was derived from brain sections of a model for HIV-induced brain injury (HIVgp120tg), which expresses soluble gp120 envelope protein in astrocytes under the control of a modified GFAP promoter. The mice were in a mixed C57BL/6.129/SJL genetic background, and two genotypes of 9 month old male mice were selected: wild type controls (Resting, n = 3) and transgenic littermates (HIVgp120tg, Activated, n = 3). No randomization was performed. HIVgp120tg mice show among other hallmarks of human HIV neuropathology an increase in microglia numbers which indicates activation of the cells compared to non-transgenic littermate controls.

1 PAPER • NO BENCHMARKS YET

AI-ready multiplex IHC-IF dataset

AI-ready multiplex IHC-IF dataset (AI-ready restained and co-registered multiplex dataset for head-and-neck squamous cell carcinoma)

We introduce a new AI-ready computational pathology dataset containing restained and co-registered digitized images from eight head-and-neck squamous cell carcinoma patients. Specifically, the same tumor sections were stained with the expensive multiplex immunofluorescence (mIF) assay first and then restained with cheaper multiplex immunohistochemistry (mIHC). This is a first public dataset that demonstrates the equivalence of these two staining methods which in turn allows several use cases; due to the equivalence, our cheaper mIHC staining protocol can offset the need for expensive mIF staining/scanning which requires highly skilled lab technicians. As opposed to subjective and error-prone immune cell annotations from individual pathologists (disagreement > 50%) to drive SOTA deep learning approaches, this dataset provides objective immune and tumor cell annotations via mIF/mIHC restaining for more reproducible and accurate characterization of tumor immune microenvironment (e.g. for

1 PAPER • NO BENCHMARKS YET

ATUE

ATUE is an antibody study benchmark with four real-world supervised tasks covering therapeutic antibody engineering, B cell analysis, and antibody discovery.

1 PAPER • NO BENCHMARKS YET

Datasets for automatic acoustic identification of insects (Orthoptera and Cicadidae)

This dataset contains recordings of 32 sound producing insect species with a total 335 files and a length of 57 minutes. The dataset was compiled for training neural networks to automatically identify insect species while comparing adaptive, waveform-based frontends to conventional mel-spectrogram frontends for audio feature extraction. This work will be submitted for publication in the future and this dataset can be used to replicate the results, as well as other uses. The scripts for audio processing and the machine learning implementations will be published on Github.

1 PAPER • NO BENCHMARKS YET

Drosophila Immunity Time-Course Data

The data used for all results in this paper can be found here. This directory contains:

1 PAPER • NO BENCHMARKS YET

Extended heartSeg

The dataset X of this work is an extension of the heartSeg dataset. Each sample x ∈ X is an RGB image capturing the heart region of Medaka (Oryzias latipes) hatchlings from a constant ventral view. Since the body of Medaka is see-through, noninvasive studies regarding the internal organs and the whole circulatory system are practicable. A Medaka’s heart contains three parts: the atrium, the ventricle, and the bulbus. The atrium receives deoxygenated blood from the circulatory system and delivers it to the ventricle, which forwards it into the bulbus. The bulbus is the heart’s exit chamber and provides the gill arches with a constant blood flow. The blood flow through these three chambers was captured in 63 short recordings (around 11 seconds with 24 frames per second each) in total, from which the single image samples x ∈ X are extracted. The dataset is split into training and test data following the heartSeg dataset with ntrain = 565 samples in the training set Xtrain and ntest = 165

1 PAPER • 1 BENCHMARK

FLIP -- AAV, Designed vs mutant

FLIP -- AAV, Designed vs mutant (adeno-associated virus)

1 PAPER • NO BENCHMARKS YET

Facial Skeletal angles (Facial Skeletal Angles (Glabella and Maxilla Angle and Length and Width of Piriformis))

Facial Skeletal Angles (Glabella and Maxilla Angle and Length and Width of Piriformis)

1 PAPER • NO BENCHMARKS YET

GO21

GO21 is a biomedical knowledge graph that models genes, proteins, drugs, and the hierarchy of the biological processes they participate in. It consists of 806,136 triples with 21 relations and 89127 entities. GO21 can be used for knowledge graph completion tasks (link prediction) as well as hierarchical reasoning tasks, such as ancestor-descendant prediction task proposed in the paper.

1 PAPER • 1 BENCHMARK

Image-based size estimation of broccoli heads under varying degrees of occlusion

This publicly available dataset contains 1613 RGB-D images of field-grown broccoli plants. The dataset also includes the polygon and circle annotations of the broccoli heads.

1 PAPER • NO BENCHMARKS YET

Ladybird Cobbitty 2017 Brassica Dataset

This data set contains weekly scans of cauliflower and broccoli covering a ten week growth cycle from transplant to harvest. The data set includes ground-truth, physical characteristics of the crop; environmental data collected by a weather station and a soil-senor network; and scans of the crop performed by an autonomous agricultural robot, which include stereo colour, thermal and hyperspectral imagery. The crop were planted at Lansdowne Farm, a University of Sydney agricultural research and teaching facility. Lansdowne Farm is located in Cobbitty, a suburb 70km south-west of Sydney in New South Wales (NSW), Australia. Four 80 metre raised crop beds were prepared with a North-South orientation. Approximately 144 Brassica were planted in each bed. Cauliflower were planted in the first and third bed (from west to east). Broccoli were planted in the second and fourth beds.

1 PAPER • NO BENCHMARKS YET

Marine Microalgae Detection in Microscopy Images

Marine Microalgae Detection in Microscopy Images dataset contains a total number of images in the dataset is 937 and all the objects in these images were annotated. The total number of annotated objects is 4201. The training set contains 537 images and the testing set contains 430 images.

1 PAPER • NO BENCHMARKS YET

PS4

A dataset of 18,731 proteins with their PDB code, index of the first residue in their respective DSSP file, their residue sequence and 9-category secondary structure sequence (including polyproline helices).

1 PAPER • 1 BENCHMARK

Pollen et al

TPM values together with cell type annotations that were obtained from Alex Pollen on 15/10/15

1 PAPER • 1 BENCHMARK

PubChem18

PubChem18 (PubChem 2018)

A.2.1 AN OPEN, LARGE-SCALE DATASET FOR ZERO-SHOT DRUG DISCOVERY DERIVED FROM PUBCHEM We constructed a large public dataset extracted from PubChem (Kim et al., 2019; Preuer et al., 2018), an open chemistry database, and the largest collection of readily available chemical data. We take assays ranging from 2004 to 2018-05. It initially comprises 224,290,250 records of molecule-bioassay activity, corresponding to 2,120,854 unique molecules and 21,003 unique bioassays. We find that some molecule-bioassay pairs have multiple activity records, which may not all agree. We reduce every molecule-bioassay pair to exactly one activity measurement by applying majority voting. Molecule-bioassay pairs with ties are discarded. This step yields our final bioactivity dataset, which features 223,219,241 records of molecule-bioassay activity, corresponding to 2,120,811 unique molecules and 21,002 unique bioassays ranging from AID 1 to AID 1259411. Molecules range up to CID 132472079. The dataset has 3 di

1 PAPER • NO BENCHMARKS YET

Stained mice brain blood vessels. Confocal-LFM

3D confocal stacks with corresponding 2D Light-field microscope images

1 PAPER • NO BENCHMARKS YET

TERRA-REF (TERRA-REF, An open reference data set from high resolution genomics, phenomics, and imaging sensors)

The ARPA-E funded TERRA-REF project is generating open-access reference datasets for the study of plant sensing, genomics, and phenomics. Sensor data were generated by a field scanner sensing platform that captures color, thermal, hyperspectral, and active flourescence imagery as well as three dimensional structure and associated environmental measurements. This dataset is provided alongside data collected using traditional field methods in order to support calibration and validation of algorithms used to extract plot level phenotypes from these datasets.

1 PAPER • NO BENCHMARKS YET

The EMBO SourceData-NLP dataset

The EMBO SourceData-NLP dataset (The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models)

We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process. A unique feature of this dataset is its emphasis on the annotation of bioentities in figure legends. We annotate eight classes of biomedical entities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), their role in the experimental design, and the nature of the experimental method as an additional class. SourceData-NLP contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 papers in molecular and cell biology. We illustrate the dataset's usefulness by assessing BioLinkBERT and PubmedBERT, two transformers-based models, fine-tuned on the SourceData-NLP dataset for NER. We also introduce a novel context-dependent semantic task that infers whether an entity is the target of a controlled intervention or the object of measurement.

1 PAPER • 1 BENCHMARK

VISEM-Tracking

VISEM-Tracking is a dataset consisting of 20 video recordings of 30s of spermatozoa with manually annotated bounding-box coordinates and a set of sperm characteristics analyzed by experts in the domain. It is an extension of the previously published VISEM dataset. In addition to the annotated data, unlabeled video clips are provided for easy-to-use access and analysis of the data.

1 PAPER • NO BENCHMARKS YET

VesselGraph

VesselGraph is a dataset of whole-brain vessel graphs based on specific imaging protocols. Specifically, vascular graphs are extracted using a refined graph extraction scheme leveraging the volume rendering engine Voreen and provided in an accessible and adaptable form through the OGB and PyTorch Geometric dataloaders.

1 PAPER • NO BENCHMARKS YET

YIM Dataset (Yeast Cells in Microstructures Dataset)

An instance segmentation dataset of yeast cells in microstructures. The dataset includes 493 densely annotated microscopy images. For more information see the paper "An Instance Segmentation Dataset of Yeast Cells in Microstructures".

1 PAPER • NO BENCHMARKS YET

ALFI (Annotations for Label-Free Images)

ALFI (Annotations for Label-Free Images) is a dataset of images and annotations for label-free microscopy imaging. It consists of 29 time-lapse image sequences with various annotations (pixel-wise segmentation masks, object-wise bounding boxes, and tracking information), made publicly available to the scientific community through figshare.

0 PAPER • NO BENCHMARKS YET

Genome-wide miRNA detection (Genome-wide hairpins datasets of animals and plants for novel miRNA prediction)

We've made available several genome-wide datasets, which can be used for training microRNA (miRNA) classifiers. The hairpin sequences available are from the genomes of: Homo sapiens, Arabidopsis thaliana, Anopheles gambiae, Caenorhabditis elegans and Drosophila melanogaster. Hairpin.s are small RNA sequences that naturaly folds into a hairpin-structure. However, not all hairpins have clear function (they are not miRNAs).

0 PAPER • NO BENCHMARKS YET

HeartSeg

The medaka (Oryzias latipes) and the zebrafish (Danio rerio) are used as a model organism for a variety of subjects in biomedical research. The presented work aims to study the potential of automated ventricular dimension estimation through heart segmentation in medaka. For more on this, it's time for a closer look on our paper and the supplementary materials.

0 PAPER • NO BENCHMARKS YET