Awesome

← All awesome lists Search these repos

Awesome List

Awesome Web Archiving

An Awesome List for getting started with web archiving

iipc/awesome-web-archiving #awesome#awesome-list#webarchiving

Open GitHub

List stars: 2,562
README repos: 105
Indexed repos: 103
List commits: 162
Forks: 193
Open issues: 9

Tracked list growth

GitHub stars and default-branch commits for iipc/awesome-web-archiving.

Latest scan 2026-06-03 10:49

Likes history

GitHub stars

Commits history

Default branch commits

Indexed repositories

103 repos currently saved from this list.

No filters applied

Latest repo push 2026-05-30

Filter this list

Search within Awesome Web Archiving or narrow by ecosystem and project health.

Search repositories

Search mode

Keyword Semantic

Tune results

The controls most people need first.

Language

Freshness

Sort

Direction

More filters Topics, generated tags, stack, age, archive status, and growth.

Ecosystem

GitHub topic

Generated tag

Framework or stack

Package manager

Health

Minimum stars

Repository age

Uses known first-commit dates.

Archive status

AI development signals

Momentum

Unmaintained for

Commit velocity

Star growth

Reset filters

Highlighted

Open highlighted repo slot

Put your repository first

Promote a GitHub repo at the top of Awesome repository list views for 7 days.

turicas/crau

Easy-to-use Web archiver

Python pushed 2026-04-13 76 commits first commit 2019-10-26 1 list mention

★ 64

GitHub ↗

peterk/warcworker

A dockerized, queued high fidelity web archiver based on Squidwarc

Python #archiving#high-fidelity-preservation#preservation#webarchives pushed 2024-07-09 34 commits first commit 2018-07-21 1 list mention

★ 62

GitHub ↗

iipc/jwarc

Java library for reading and writing WARC files with a typed API

Java pushed 2026-04-27 517 commits first commit 2015-09-21 1 list mention

★ 59

GitHub ↗

jedireza/warc

:gear: A Rust library for reading and writing WARC files

Rust #rust#rust-library#warc pushed 2024-11-27 39 commits first commit 2016-03-22 1 list mention

★ 59

Website ↗ GitHub ↗

machawk1/Mink

Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy

JavaScript #chrome#extension#internet-archive#memento pushed 2025-08-27 597 commits first commit 2014-01-17 1 list mention

★ 58

GitHub ↗

iipc/warc2html

Converts WARC files to static HTML

Java pushed 2025-09-18 9 commits first commit 2021-11-08 1 list mention

★ 56

GitHub ↗

webrecorder/har2warc

Convert HTTP Archive (HAR) -> Web Archive (WARC) format

Python pushed 2018-10-21 18 commits first commit 2017-03-16 1 list mention

★ 55

Website ↗ GitHub ↗

wabarc/cairn

NPM package and CLI tool for saving web page as single HTML file

TypeScript #archive#base64#cli#html pushed 2026-05-30 101 commits first commit 2020-10-09 1 list mention

★ 52

GitHub ↗

archivesunleashed/warclight

A Rails engine supporting the discovery of web archives.

Ruby #blacklight#discovery#rails#rails-engine pushed 2023-06-13 301 commits first commit 2017-08-03 1 list mention archived

★ 50

Website ↗ GitHub ↗

ikreymer/webarchive-indexing

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

Python pushed 2017-12-04 47 commits first commit 2015-02-26 1 list mention

★ 47

GitHub ↗

commoncrawl/whirlwind-python

A whirlwind tour of Common Crawl's data using Python

Python #archive#python#tutorial#warc pushed 2026-04-13 33 commits first commit 2024-06-23 1 list mention

★ 45

GitHub ↗

PromyLOPh/crocoite

Web archiving using Google Chrome

Python #archiving#chrome-browser#devtools#warc pushed 2019-12-30 282 commits first commit 2017-11-17 1 list mention archived

★ 45

Website ↗ GitHub ↗

nla/outbackcdx

Web archive index server based on RocksDB

Java #wayback#web-archiving pushed 2026-05-01 588 commits first commit 2015-01-15 1 list mention

★ 43

GitHub ↗

ukwa/shine

Prototype SOLR-powered web archive exploration UI.

JavaScript pushed 2020-06-03 748 commits first commit 2013-07-03 1 list mention archived

★ 43

Website ↗ GitHub ↗

harvard-lil/bag-nabit

Download and attach provenance to public datasets

Python pushed 2025-03-31 39 commits first commit 2024-11-28 1 list mention

★ 38

GitHub ↗

nla/httrack2warc

Converts HTTrack crawls to WARC files

Java #web-archiving pushed 2024-08-06 117 commits first commit 2017-10-23 1 list mention

★ 34

GitHub ↗

chfoo/warcat-rs

Command-line tool and Rust library for handling Web ARChive (WARC) files

Rust pushed 2025-06-02 81 commits first commit 2024-09-15 1 list mention

★ 31

GitHub ↗

webis-de/wasp

No description.

Java pushed 2022-10-14 57 commits first commit 2018-03-25 1 list mention

★ 28

GitHub ↗

archivesunleashed/notebooks

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

Jupyter Notebook #juypter-notebook#notebooks#pyspark-notebook#python3 pushed 2022-12-05 185 commits first commit 2019-11-06 1 list mention

★ 26

GitHub ↗

helgeho/Web2Warc

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

Scala pushed 2017-10-09 11 commits first commit 2016-01-29 1 list mention

★ 26

GitHub ↗

alard/wget-lua

Wget with Lua extension

C pushed 2015-12-17 3,507 commits first commit 1999-12-02 1 list mention

★ 24

GitHub ↗

steffenfritz/html2warc

simple script to convert web resources to a single warc file

Python pushed 2023-05-11 16 commits first commit 2015-01-19 1 list mention

★ 24

GitHub ↗

vphill/web-archiving-course

Web Archiving Course

pushed 2024-03-04 103 commits first commit 2022-02-22 1 list mention

★ 23

GitHub ↗

lanl/Zotero-Robust-Links-Extension

Create Robust Links from within Zotero

JavaScript #link-rot#memento#reference-rot#references pushed 2022-05-10 94 commits first commit 2019-08-29 1 list mention

★ 22

GitHub ↗

internetarchive/arch

Web application for distributed compute analysis of Archive-It web archive collections.

Scala pushed 2026-03-24 569 commits first commit 2020-11-30 1 list mention

★ 20

GitHub ↗

midwork-finds-jobs/duckdb-web-archive

DuckDB extension to fetch pages from Wayback Machine & Common Crawl

C++ #archived#cdx-api#common-crawl#duckdb pushed 2026-02-03 68 commits first commit 2025-11-21 1 list mention AI dev signals

★ 20

GitHub ↗

richardlehane/webarchive

golang readers for ARC and WARC webarchive formats

Go pushed 2023-04-03 57 commits first commit 2015-09-21 1 list mention

★ 20

GitHub ↗

natliblux/warc-safe

A tool for detecting viruses and NSFW material in WARC files

Python #antivirus#nsfw-classifier#warc#warc-safe pushed 2026-05-20 30 commits first commit 2024-05-03 1 list mention

★ 18

GitHub ↗

nlnwa/gowarcserver

No description.

Go pushed 2025-03-31 372 commits first commit 2021-02-02 1 list mention

★ 17

GitHub ↗

Guillaume-Levrier/PANDORAE

A data retrieval & exploration protocol designed to investigate science and policy processes

JavaScript pushed 2026-05-28 1,514 commits first commit 2018-12-10 1 list mention

★ 16

Website ↗ GitHub ↗

internetarchive/Sparkling

Internet Archive's Sparkling Data Processing Library

Scala pushed 2026-05-04 47 commits first commit 2022-04-28 1 list mention

★ 16

GitHub ↗

unt-libraries/py-wasapi-client

A client for the Archive-It And Webrecorder WASAPI Data Transfer API

Python pushed 2019-10-18 118 commits first commit 2017-08-10 1 list mention

★ 16

GitHub ↗

oduwsdl/ORS

Object Resource Stream and CDXJ Drafts

pushed 2018-11-28 20 commits first commit 2015-10-06 1 list mention

★ 15

GitHub ↗

harvard-lil/warcbench

A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.

Python pushed 2025-07-30 348 commits first commit 2025-02-24 1 list mention

★ 14

GitHub ↗

emmadickson/unwarcit

No description.

Python pushed 2022-01-07 8 commits first commit 2021-12-13 1 list mention

★ 13

GitHub ↗

wabarc/playback

Playback webpages from Wayback Machine

Go #go#playback#wayback-machine#webpage pushed 2026-04-25 21 commits first commit 2021-04-17 1 list mention

★ 13

GitHub ↗

httpreserve/tikalinkextract

Tika based link (URL) extractor for httpreserve

HTML #archives#code4lib#digitalpreservation#httpreserve pushed 2025-04-26 51 commits first commit 2017-04-03 1 list mention

★ 11

GitHub ↗

oduwsdl/MementoMap

A Tool to Summarize Web Archive Holdings

Python #memento#mementomap#profiling#python pushed 2021-06-15 63 commits first commit 2019-01-20 1 list mention

★ 11

GitHub ↗

ukwa/ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.

Java pushed 2024-12-02 575 commits first commit 2013-06-06 1 list mention archived

★ 11

GitHub ↗

arcalex/warcrefs

Web archive deduplication tools

Java pushed 2018-10-18 39 commits first commit 2014-04-22 1 list mention

★ 10

GitHub ↗

archivesunleashed/twut

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

Scala #apache-spark#spark#spark-packages#tweets pushed 2026-03-17 41 commits first commit 2019-11-29 1 list mention

★ 10

GitHub ↗

httpreserve/linkstat

CLI implementation of httpreserve that can test links and retrieve internet archive replacements

Go #archives#cli#code4lib#digipres pushed 2024-11-21 18 commits first commit 2019-03-19 1 list mention

★ 10

Website ↗ GitHub ↗

web-archive-group/heritrix-walkthrough

No description.

Shell pushed 2016-06-10 6 commits first commit 2016-06-01 1 list mention

★ 10

GitHub ↗

helgeho/HadoopConcatGz

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

Java #hadoop#spark#warc#web-archiving pushed 2018-02-07 55 commits first commit 2016-08-08 1 list mention

★ 9

GitHub ↗

sul-dlss-deprecated/wasapi-downloader

Java application to download WARCs from WASAPI

Java #application#infrastructure#java pushed 2025-11-17 348 commits first commit 2017-04-28 1 list mention

★ 7

GitHub ↗

netarchivesuite/jwat-tools

JWAT Tools

Java pushed 2023-12-13 141 commits first commit 2012-02-02 1 list mention

★ 5

GitHub ↗

commoncrawl/whirlwind-python-notebook

A jupyter notebook illistrating the basics of Common Crawl's datasets.

Jupyter Notebook #aws#open-datasets#s3#sagemaker pushed 2025-12-31 7 commits first commit 2025-10-28 1 list mention

★ 4

GitHub ↗

midwork-finds-jobs/duckdb_warc

DuckDB extension for parsing WARC files

Rust pushed 2026-02-05 139 commits first commit 2024-09-16 1 list mention

★ 4

GitHub ↗

netarchivesuite/jwat

Java Web Archive Toolkit

Java pushed 2023-12-13 331 commits first commit 2011-10-04 1 list mention

★ 4

GitHub ↗

commoncrawl/whirlwind-java

A whirlwind tour of Common Crawl's data using Java

Java pushed 2026-04-20 17 commits first commit 2025-10-21 1 list mention

★ 3

GitHub ↗

Activity

Default branch: main
Last pushed: 2026-04-27
GitHub updated: 2026-06-03
Created: 2017-06-16
First commit: -
Last scanned: 2026-06-03 10:49
Watchers: 92

Indexed repo mix

Repo stars: 147,552
Repo forks: 11,123
Active: 97
Archived: 6

Languages

Python (31) Java (17) JavaScript (16) Go (9) Scala (6) Rust (5) TypeScript (4) Jupyter Notebook (3) C (2) C++ (2) HTML (2) Roff (1)

Awesome Web Archiving

Tracked list growth

Likes history

Commits history

Indexed repositories

Filter this list

Put your repository first

How it works

Pricing