Datasets
End of Term Datasets
The End of Term project is working with the Amazon Web Services' Open Data Sponsorship Program to host a copy of the 2004, 2008, 2012, 2016, 2020 and 2024 End of Term Datasets.
The work of inventorying, staging and moving the data into AWS is still ongoing and more information will be provided here in the future.
Currently we have these datasets partially available for use.
Dataset | WARC # | WARC Size Compressed |
---|---|---|
EOT-2024 (Upload in process) | 812512 | 1492.8 TB |
EOT-2020 | 239811 | 266.04 TB |
EOT-2016 | 194683 | 139.3 TB |
EOT-2012 | 78509 | 41.42 TB |
EOT-2008 | 125704 | 15.32 TB |
EOT-2004 | 58977 | 6.42 TB |
End of Term Web Crawls Collection
Additionally, crawl data is available from the Internet Archive via the End of Term Web Crawls collection.