Evaluating Machine Accuracy on ImageNet

We perform an in-depth evaluation of human accuracy on the ImageNet dataset. First, three expert labelers re-annotated 30,000 images from the original ImageNet validation set and the ImageNetV2 replication experiment with multi-label annotations to enable a semantically coherent accuracy measurement. Then we evaluated five trained humans on both datasets. The median of the five labelers outperforms the best publicly released ImageNet model by 1.5% on the original validation set and by 6.2% on ImageNetV2. Moreover, the human labelers see a substantially smaller drop in accuracy between the two datasets compared to the best available model (less than 1% vs 5.4%). Our results put claims of superhuman performance on ImageNet in context and show that robustly classifying ImageNet at human-level performance is still an open problem.

No code implementations yet. Submit your code now



Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.


No methods listed for this paper. Add relevant methods here