End of Term 2008 Dataset

The End of Term 2008 Dataset represents data collected by four collecting institutions. These institutions were the California Digital Library (CDL), the Internet Archive (IA), the Library of Congress (LOC) and the University of North Texas Libraries (UNT). The data is part of the initiative called the End of Term Presidential Web Archive.

Archive Location and Download

The 2008 End of Term archive is located on the eotarchive bucket at EOT-2008.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT, WET, and CDX files.

By adding either s3://eotarchive/ or https://eotarchive.s3.amazonaws.com/ to each line, you end up with the s3 and HTTP paths respectively.

File List #Files Total Size
Compressed
Segments EOT-2008/segment.paths.gz 14  
WARC files EOT-2008/warc.paths.gz 125704 15.32 TB
WAT files EOT-2008/wat.paths.gz 125704 447.08 GB
WET files EOT-2008/wet.paths.gz 125704 108.1 GB
META files EOT-2008/meta.paths.gz 125704 68.49 GB
CDX files EOT-2008/cdx.paths.gz 125704 9.41 GB
URL Index files EOT-2008/eot-index.paths.gz 49 7 GB