Various URL Datasets (https://github.com/ada-url/url-various-datasets)

Introduced by Nizipli et al. in Parsing Millions of URLs per Second

Various URL Datasets

These are collections of URLs for benchmarking purposes.

  • files/node_files.txt: all source files from a given Node.js snapshot as URLs (43415 URLs).
  • files/linux_files.txt: all files from a Linux systems as URLs (169312 URLs).
  • wikipedia/wikipedia_100k.txt: 100k URLs from a snapshot of all Wikipedia articles as URLs (March 6th 2023)
  • others/kasztp.txt: test URLs from https://github.com/kasztp/URL_Shortener (MIT License) (48009 URLs).
  • others/userbait.txt : test URLs from https://github.com/userbait/phishing_sites_detector (unknown copyright) (11430 URLs).
  • top100/top100.txt: crawl of the top visited 100 websites and extracts unique URLs

Disclaimer: This repository is developed and released for research purposes only. - This project reshares some publicly available datasets. When in doubt, investigate the copyright of the files you want to use. - There may be errors and duplicates in these files.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • Unknown

Modalities


Languages