Open highlighted repo slot
Put your repository first
Promote a GitHub repo at the top of Awesome repository list views for 7 days.
Awesome List
An Awesome List for getting started with web archiving
GitHub stars and default-branch commits for iipc/awesome-web-archiving.
103 repos currently saved from this list.
Open highlighted repo slot
Promote a GitHub repo at the top of Awesome repository list views for 7 days.
Easy-to-use Web archiver
A dockerized, queued high fidelity web archiver based on Squidwarc
Java library for reading and writing WARC files with a typed API
:gear: A Rust library for reading and writing WARC files
Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy
Converts WARC files to static HTML
Convert HTTP Archive (HAR) -> Web Archive (WARC) format
NPM package and CLI tool for saving web page as single HTML file
A Rails engine supporting the discovery of web archives.
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
A whirlwind tour of Common Crawl's data using Python
Web archiving using Google Chrome
Web archive index server based on RocksDB
Prototype SOLR-powered web archive exploration UI.
Download and attach provenance to public datasets
Converts HTTrack crawls to WARC files
Command-line tool and Rust library for handling Web ARChive (WARC) files
No description.
Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.
An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)
Wget with Lua extension
simple script to convert web resources to a single warc file
Web Archiving Course
Create Robust Links from within Zotero
Web application for distributed compute analysis of Archive-It web archive collections.
DuckDB extension to fetch pages from Wayback Machine & Common Crawl
golang readers for ARC and WARC webarchive formats
A tool for detecting viruses and NSFW material in WARC files
No description.
A data retrieval & exploration protocol designed to investigate science and policy processes
Internet Archive's Sparkling Data Processing Library
A client for the Archive-It And Webrecorder WASAPI Data Transfer API
Object Resource Stream and CDXJ Drafts
A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.
No description.
Playback webpages from Wayback Machine
Tika based link (URL) extractor for httpreserve
A Tool to Summarize Web Archive Holdings
The UKWA Heritrix3 custom modules and Docker builder.
Web archive deduplication tools
An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.
CLI implementation of httpreserve that can test links and retrieve internet archive replacements
No description.
A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
Java application to download WARCs from WASAPI
JWAT Tools
A jupyter notebook illistrating the basics of Common Crawl's datasets.
DuckDB extension for parsing WARC files
Java Web Archive Toolkit
A whirlwind tour of Common Crawl's data using Java