Open highlighted repo slot
Put your repository first
Promote a GitHub repo at the top of Awesome repository list views for 7 days.
Awesome List
An Awesome List for getting started with web archiving
GitHub stars and default-branch commits for iipc/awesome-web-archiving.
103 repos currently saved from this list.
Open highlighted repo slot
Promote a GitHub repo at the top of Awesome repository list views for 7 days.
Powerful yet simple to use screenshot software :desktop_computer: :camera_flash:
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Web Extension for saving a faithful copy of a complete web page in a single HTML file
Fast key-value DB in Go.
⬛️ CLI tool and library for saving complete web pages as a single HTML file
Chrome Debugging Protocol interface for Node.js
💾 dn - offline full-text search and archiving for your Chromium-based browser.
fake keyboard/mouse input, window management, and more
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
An archiving tool with an IM-style interface that prioritizes privacy and accessibility, integrated with various archival services including Internet Archive, archive.today, Ghostarchive, IPFS, Telegraph, and file systems.
A Python and Command-Line Interface to Archive.org
Core Python Web Archiving Toolkit for replay and recording of web archives
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
A command line tool (and Python library) for archiving Twitter JSON
Automatically archive links to videos, images, and social media content from Google Sheets (and more).
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2026, WikiTeam has preserved more than 600,000 wikis.
brozzler - distributed browser-based web crawler
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Wget-compatible web downloader and crawler.
Wayback Machine API interface & a command-line tool
The OpenWayback Development
Streaming WARC/ARC library for fast web archive IO
WARC writing MITM HTTP/S proxy
A Tool To Push Web Resources Into Web Archives
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
WarcDB: Web crawl data as SQLite databases.
:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation
Websites crawler with built-in exploration and control web interface
Zotero extension that combats link rot by archiving webpages and journal articles.
Go package and CLI tool for saving web page as single HTML file
Snapshots a web page to get it as a static, self-contained HTML document.
🍨 High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web.
Extract web archive data using Wayback Machine and Common Crawl
Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Tool and library for handling Web ARChive (WARC) files.
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
A search interface and wayback machine for the UKWA Solr based warc-indexer framework.
A robust web archive analytics toolkit
Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in this repo is now only for reference. For support and issues of 'warc-indexer', please communicate with NetArchiveSuite.
:whale2: One-Click User Instigated Preservation
Parse And Create Web ARChive (WARC) files with node.js
📚 A compilation of research relevant to Data Together's efforts tackling the general problem of data resilience & interactivity
Offline-first web browser
A Memento Aggregator CLI and Server in Go
A commandline tool and Python library for archiving data from Facebook using the Graph API.
A collection of tools for archiving and analysing the internet.
Various Jupyter notebooks about Common Crawl data