This page is archived. You can find the 2022 edition here: ML Reproducibility Challenge 2022

ML Reproducibility Challenge 2021

Welcome to the ML Reproducibility Challenge 2021 Fall Edition! This is the fifth edition of this event, and a successor of the ML Reproducibility Challenge 2020 (and previous editions V1, V2, V3), and we are excited this year to broaden our coverage of conferences and papers to cover nine top venues of 2021, including: NeurIPS, ICML, ICLR, ACL-IJCNLP, EMNLP, CVPR, ICCV, AAAI and IJCAI.

The primary goal of this event is to encourage the publishing and sharing of scientific results that are reliable and reproducible. In support of this, the objective of this challenge is to investigate reproducibility of papers accepted for publication at top conferences by inviting members of the community at large to select a paper, and verify the empirical results and claims in the paper by reproducing the computational experiments, either via a new implementation or using code/data or other information provided by the authors.

All submitted reports will be peer reviewed and shown next to the original papers on Papers with Code. Reports are peer-reviewed via OpenReview. Every year, a small number of these reports, selected for their clarity, thoroughness, correctness and insights, are selected for publication in a special edition of the journal ReScience. (see J1, J2).

Decisions Announced for RC 2021
11th April, 2022

We are happy to announce the decisions for the ML Reproducibility Challenge 2021! We received 102 submissions, and this year marks yet another iteration of exceptionally high quality reproducibility reports. While we did had to desk reject several reports due to double-blind anonymity violations, formatting issues, and incorrect submissions, the quality of the submissions exceed last year's standards by a proverbial mile. After an extensive peer review and meta review process, we are delighted to accept 43 reports to the program, all of which raise the bar in the standard and process of reproducibility effort in Machine Learning. You can view the list of accepted papers here in our OpenReview portal. All reviews and meta reviews will be made public, and authors of the accepted reports will be contacted separately on the next steps for the ReScience Journal editorial process. Congratulations to all!

Awards and Recognition

Starting this iteration, we have decided to award the best papers submitted to the Reproducibility Challenge to appreciate the excellent quality of all-round reproducibility effort. The selection criteria consisted of votes from the Area Chairs based on the reproducibility motivation, experimental depth, results beyond the original paper, ablation studies and discussion/recommendation for reproducibility. We believe the community will appreciate the strong reproducibility effort in these papers, which will improve the understanding of their original publications, and inspire authors to promote better reproducible science in their own work.

Best Paper Award

Reproducibility Study of “Counterfactual Generative Networks”, Piyush Bagad, Jesse Maas, Paul Hilders, Danilo de Goede, Forum, Original Paper (ICML 2021)
Scope of Reproducibility
In this work, we study the reproducibility of the paper Counterfactual Generative Networks (CGN) by Sauer and Geiger to verify their main claims, which state that (i) their proposed model can reliably generate high-quality counterfactual images by disentangling the shape, texture and background of the image into independent mechanisms, (ii) each independent mechanism has to be considered, and jointly optimizing all of them end-to-end is needed for high-quality images, and (iii) despite being synthetic, these counterfactual images can improve out-of-distribution performance of classifiers by making them invariant to spurious signals.
The authors of the paper provide the implementation of CGN training in PyTorch. However, they did not provide code for all experiments. Consequently, we re-implemented the code for most experiments, and run each experiment on 1080 Ti GPUs. Our reproducibility study comes at a total computational cost of 112 GPU hours.
We find that the main claims of the paper of (i) generating high-quality counterfactuals, (ii) utilizing appropriate inductive biases, and (iii) using them to instil invariance in classifiers, do largely hold. However, we found certain experiments that were not directly reproducible due to either inconsistency between the paper and code, or incomplete specification of the necessary hyperparameters. Further, we were unable to reproduce a subset of experiments on a large-scale dataset due to resource constraints, for which we compensate by performing those on a smaller version of the same dataset with our results supporting the general performance trend.
What was easy
The original paper provides an extensive appendix with implementation details and hyperparameters. Beyond that, the original code implementation was publicly accessible and well structured. As such, getting started with the experiments proved to be quite straightforward. The implementation included configuration files, download scripts for the pretrained weights and datasets, and clear instructions on how to get started with the framework.
What was difficult
Some of the experiments required severe modifications to the provided code. Additionally, some details required for the implementation are not specified in the paper or inconsistent with the specifications in the code. Lastly, in evaluating out-of-distribution robustness, getting the baseline model to work and obtaining numbers similar to those reported in the respective papers was challenging, partly due to baseline model inconsistencies within the literature.
Communication with original authors
We have reached out to the original authors to get clarifications regarding the setup of some of the experiments, but unfortunately, we received a late response and only a subset of our questions was answered.

Outstanding Paper Awards

[Re] Learning to count everything, Matija Teršek, Domen Vreš, Maša Kljun, Forum, Original Paper (CVPR 2021)
Scope of Reproducibility
The core finding of the paper is a novel architecture FamNet for handling the few-shot counting task. We examine its implementation in the provided code on GitHub and compare it to the theory in the original paper. The authors also introduce a data set with 147 visual categories FSC-147, which we analyze. We try to reproduce the authors’ results on it and on CARPK data set. Additionally, we test FamNet on a category specific data set JHU-CROWD++. Furthermore, we try to reproduce the ground truth density maps, the code for which is not provided by the authors.
We use the combination of the authors’ and our own code, for parts where the code is not provided (e.g., generating ground truth density maps, CARPK data set preprocessing). We also modify some parts of the authors’ code so that we can evaluate the model on various data sets. For running the code we used the Quadro RTX 5000 GPU and had a total computation time of approximately 50 GPU hours.
We could not reproduce the density maps, but we produced similar density maps by modifying some of the parameters. We exactly reproduced the results on the paper’s data set. We did not get the same results on the CARPK data set and in experiments where implementation details were not provided. However, the differences are within standard error and our results support the claim that the model outperforms the baselines.
What was easy
Running the pretrained models and the demo app was quite easy, as the authors provided instructions. It was also easy to reproduce the results on a given data set with a pretrained model.
What was difficult
It was difficult to verify the ground truth density map generation as the code was not provided and the process was incorrectly described. Obtaining a performant GPU was also quite a challenge and it took quite many emails to finally get one. This also meant that we were unable to reproduce the training of the model.
Communication with original authors
We contacted the authors three times through issues on GitHub. They were helpful and responsive, but we have not resolved all of the issues.

[RE] An Implementation of Fair Robust Learning , Ian Hardy, Forum, Original Paper (ICML 2021)
Scope of Reproducibility
This work attempts to reproduce the results of the 2021 ICML paper "To be Robust or to be Fair: Towards Fairness in Adversarial Training." I first reproduce classwise accuracy and robustness discrepancies resulting from adversarial training, and then implement the authors' proposed Fair Robust Learning (FRL) algorithms for correcting this bias.
In the spirit of education and public accessibility, this work attempts to replicate the results of the paper from first principles using Google Colab resources. To account for the limitations imposed by Colab, a much smaller model and dataset are used. All results can be replicated in approximately 10 GPU hours, within the usual timeout window of an active Colab session. Serialization is also built into the example notebooks in the case of crashes to prevent too much loss, and serialized models are also included in the repository to allow others to explore the results without having to run hours of code.
This work finds that (1) adversarial training does in fact lead to classwise performance discrepancies not only in standard error (accuracy) but also in attack robustness, (2) these discrepancies exacerbate existing biases in the model, (3) upweighting the standard and robust errors of poorly performing classes during training decreased this discrepancy for both both the standard error and robustness and (4) increasing the attack margin for poorly performing classes during training also decreased these discrepancies, at the cost of some performance. (1) (2) and (3) match the conclusions of the original paper, while (4) deviated in that it was unsuccessful in helping increasing the robustness the most poorly performing classes. Because the model and datasets used were totally different from the original paper's, it is hard to to quantify the exact similarity of our results. Conceptually however, I find very similar conclusions.
What was easy
It was easy to identify the unfairness resulting from existing adversarial training methods and implement the authors' FRL (reweight) and FRL (remargin) approaches for combating this bias. The algorithm and training approaches are well outlined in the original paper, and are relatively accessible even for those with little experience in adversarial training.
What was difficult
Because of the resource limitations imposed, I was unable to successfully implement the suggested training process using the authors' specific model and dataset. Also, even with a smaller model and dataset it was difficult to thoroughly tune the hyperparameters of the model and algorithm.
Communication with original authors
I did not have contact with the authors during the process of this reproduction. I reached out for feedback once I had a draft of the report, but did not hear back.

Strategic classification made practical: reproduction, Guilly Kolkman, Maks kulicki, Jan Athmer, Alex Labro, Forum, Original Paper (ICML 2021)
Scope of Reproducibility
In this work, the paper Strategic Classification Made Practical is evaluated through a reproduction study. The results from the reproduction examines if the claims made in the paper are valid. We could find two main claims that were made by the authors that we will attempt to reproduce. Those are as follows: 1."We propose a novel learning framework for strategic classification that is practical, effective, and flexible.This allows for differentiation through strategic user responses, which supports end-to-end training." 2."We propose several forms of regularization that encourage learned models to promote favorable social outcomes." We interpret practical, effective and flexible as such that the model should work better on a variety of real life problems than their non-strategic counterpart.
In this paper, the same code, datasets and hyperparameters were used as the original paper to reproduce the results. To further validate the claims from the original paper, we extended the original implementation to include an experiment that tests performance on a dataset containing both strategic (also referred to as gaming) and non-strategic users.
The reproduction of the original paper as well as the extended implementation were successful. We were able to reproduce the original results and examine the performance of the proposed model in an environment where strategic and non-strategic users both present. Linear models seem to struggle with different proportions of strategic users, while the non-linear model (RNN) achieves good performance regardless of the proportion of strategic users.
What was easy
The codebase for the paper was available on GitHub which meant that we didn’t have to start from scratch. They also provided us with the original data. The codebase also came with the original results from the authors which meant that comparing the results was easy.
What was difficult
Although the code was available, documentation of the code was quite sparse. Therefore, it was hard to figure out what each part of the code did and made it difficult to interpret what the results actually meant at certain stages.
Communication with original authors
The University of Amsterdam communicated before the course with the authors about the datasets. While working on the reproduction we sent one email about clarification of their method and to request a missing dataset.

On the reproducibility of "Exacerbating Algorithmic Bias through Fairness Attacks", Andrea Lombardo, Matteo Tafuro, Tin Hadži Veljković, Lasse Becker-Czarnetzki, Forum, Original Paper (AAAI 2021)
Scope of Reproducibility
The paper presents two novel kinds of adversarial attacks against fairness: the IAF attack and the anchoring attacks. Our goal is to reproduce the five main claims of the paper. The first claim states that using the novel IAF attack we can directly control the trade-off between the test error and fairness bias metrics when attacking. Claims two to five suggest a superior performance of the novel IAF and anchoring attacks over the two baseline models. We also extend the work of the authors by implementing a different stopping method, which changes the effectiveness of some attacks.
To reproduce the results, we use the open-source implementation provided by the authors as the main resource, although many modifications were necessary. Additionally, we implement the two baseline attacks which we compare to the novel proposed attacks. Since the assumed classifier model is a support vector machine, it is not computationally expensive to train. Therefore, we used a modern local machine and performed all of the attacks on the CPU.
Due to many missing implementation details, it is not possible to reproduce the original results using the paper alone. However, in a specific setting motivated by the authors’ code (more details in section 3), we managed to obtain results that support 3 out of 5 claims. Even though the IAF and anchoring attacks outperform the baselines in certain scenarios, our findings suggest that the superiority of the proposed attacks is not as strong as presented in the original paper.
What was easy
The novel attacks proposed in the paper are presented intuitively, so even with the lack of background in topics such as fairness, we managed to easily grasp the core ideas of the paper.
What was difficult
The reproduction of the results requires much more details than presented in the paper. Thus, we were forced to make many educated guesses regarding classifier details, defense mechanisms, and many hyperparameters. The authors also provide an open-source implementation of the code, but the code uses outdated dependencies and has many implementation faults, which made it hard to use as given.
Communication with original authors
Contact was made with the authors on two occasions. First, we asked for some clarifications regarding the provided environment. They promptly replied with lengthy answers, which allowed us to correctly run their code. Then, we requested additional details concerning the pre-processing of the datasets. The authors pointed at some of their previous projects, where we could find further information on the processing pipeline.

See all papers

See OpenReview for all papers and reviews.

Outstanding Reviewer Awards

Our program would not have been possible without the hard work and support of our reviewers. Thus, we would also like to honor them for their timely, high quality reviews which enabled us to curate high quality reproducibility reports.

  • Olivier Delalleau
  • Prithvijit Chattopadhyay
  • Pascal Lamblin
  • Samuel Albanie
  • Frederik Paul Nolte
  • Karan Shah
  • Olga Isupova
  • Maxime Wabartha
  • Leo M Lahti
  • Sunnie S. Y. Kim
  • David Rau
  • Kanika Madan
  • Cagri Coltekin
  • Tobias Uelwer
  • Alex Gu
  • Varun Sundar
  • Maxwell D Collins
  • Divyat Mahajan

We would also like to thank all of our reviewers and emergency reviewers for their timely reviews! We hope you will continue to support us in our quest to improve reproducibility in Machine Learning.

Courses Participated in RC2021 Fall Edition

Click to view the previous call for papers.

Key dates for Fall 2021 Challenge

  • Announcement of the challenge : August 31st, 2021
  • Challenge goes LIVE : August 31st, 2021
  • Submission deadline (to be considered for peer review) : February 4th, 2022 (11:59PM AOE)
  • Author Notification deadline for journal special issue: May 15th, 2022 April 11th, 2022

Invitation to participate

The challenge is a great event for community members to participate in shaping scientific practices and findings in our field. We particularly encourage participation from:

  • Course instructors of advanced ML, NLP, CV courses, who can use this challenge as a course assignment or project.
  • Organizers of hackathons.
  • Members of ML developer communities
  • ML enthusiasts everywhere!

How to participate

Top participating universities in RC2020

Contact Information

Organizing committee


  • Melisa Bok, Celeste Martinez Gomez, Mohit Uniyal, Parag Pachpute, Andrew McCallum (OpenReview / University of Massachusetts Amherst)
  • Nicolas Rougier, Konrad Hinsen (ReScience)