QM9 provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.
25 PAPERS • 3 BENCHMARKS
The PDBBind database provides a comprehensive collection of structures of protein-ligand complexes and their binding affinity data. The original experimental data in Protein Data Bank (PDB) are selected to PDBBind database based on certain quality requirements and curated for applications.
17 PAPERS • 1 BENCHMARK
The Tox21 data set comprises 12,060 training samples and 647 test samples that represent chemical compounds. There are 801 "dense features" that represent chemical descriptors, such as molecular weight, solubility or surface area, and 272,776 "sparse features" that represent chemical substructures (ECFP10, DFS6, DFS8; stored in Matrix Market Format ). Machine learning methods can either use sparse or dense data or combine them. For each sample there are 12 binary labels that represent the outcome (active/inactive) of 12 different toxicological experiments. Note that the label matrix contains many missing values (NAs). The original data source and Tox21 challenge site is https://tripod.nih.gov/tox21/challenge/.
13 PAPERS • 4 BENCHMARKS
QED is a linguistically principled framework for explanations in question answering. Given a question and a passage, QED represents an explanation of the answer as a combination of discrete, human-interpretable steps: sentence selection := identification of a sentence implying an answer to the question referential equality := identification of noun phrases in the question and the answer sentence that refer to the same thing predicate entailment := confirmation that the predicate in the sentence entails the predicate in the question once referential equalities are abstracted away. The QED dataset is an expert-annotated dataset of QED explanations build upon a subset of the Google Natural Questions dataset.
8 PAPERS • 1 BENCHMARK
SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and package inserts. The available information include side effect frequency, drug and side effect classifications as well as links to further information, for example drug–target relations.
3 PAPERS • 2 BENCHMARKS