Notebook Inaccessibility Dataset

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

This dataset artifact contains the intermediate datasets from pipeline executions necessary to reproduce the results of the paper.
We share this artifact in hopes of providing a starting point for other researchers to extend the analysis on notebooks, discover more about their accessibility, and offer solutions to make data science more accessible. The scripts needed to generate these datasets and analyse them are shared in the [Github Repository](https://github.com/make4all/notebooka11y) for this work.

The dataset contains large files of approximately 60 GB so please exercise caution when extracting the data from compressed files.

The dataset contains files which could take a significant amount of run time of the scripts to generate/reproduce.

### Dataset Contents

We briefly summarize the included files in our dataset. Please refer to the [documentation](https://github.com/make4all/notebooka11y/blob/main/pipeline/README.md) for specific information about the structure of the data in these files, the scripts to generate them, and runtimes for various parts of our data processing pipeline.

1. `epoch_9_loss_0.04706_testAcc_0.96867_X_resnext101_docSeg.pth`: We share this model file, originally provided by [Jobin et al.](https://github.com/jobinkv/DocFigure), to enable the classification of figures found in our dataset. Please place this into the `model/` [directory](https://github.com/make4all/notebooka11y/tree/main/model).
2. `model-results.csv`: This file contains results from the classification performed on the figures found in the notebooks in our dataset.
> Performing this classification may take upto a day.
3. `a11y-scan-dataset.zip`: This archive contains two files and results in datasets of approximately 60GB when extracted. Please ensure that you have sufficient disk space to uncompress this zip archive. The archive contains:
    - `a11y/a11y-detailed-result.csv`: This dataset contains the accessibility scan results from the scans run on the 100k notebooks across themes.
        > The detailed result file can be really large (> 60 GB) and can be time-consuming to construct.
    - `a11y/a11y-aggregate-scan.csv`: This file is an aggregate of the detailed result that contains the number of each type of error found in each notebook.
        > This file is also shared outside the compressed directory.
4. `errors-different-counts-a11y-analyze-errors-summary.csv`: This file contains the counts of errors that occur in notebooks across different themes.
5. `nb_processed_cell_html.csv`: This file contains metadata corresponding to each cell extracted from the html exports of our notebooks.
6. `nb_first_interactive_cell.csv`: This file contains the necessary metadata to compute the first interactive element, as defined in our paper, in each notebook.
7. `nb_processed.csv`: This file contains the necessary data after processing the notebooks extracting the number of images, imports, languages, and cell level information.
8. `processed_function_calls.csv`: This file contains the information about the notebooks, the various imports and function calls used within the notebooks.

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

Notebook Inaccessibility

Dataset Contents

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Usage

License

Modalities

Languages

Notebook Inaccessibility

Dataset Contents

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit