Sign in
← Back to search

helgeho/ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

Stars
161
Forks
19
Commits
154
Language
Scala
Awesome lists
1

Similar repositories

archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

157 stars
Scala 1 awesome list

helgeho/Web2Warc

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

26 stars
Scala 1 awesome list

archivesunleashed/twut

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

10 stars
Scala 1 awesome list

internetarchive/arch

Web application for distributed compute analysis of Archive-It web archive collections.

20 stars
Scala 1 awesome list

Tracked growth

2 captures since 2026-05-23

Latest capture 2026-05-31 03:01

Stars history

Total stars

Commits history

Default branch commits

Metadata

  • Created: 2015-08-06
  • First commit: 2015-08-06
  • Last pushed: 2025-10-08
  • Archived: no
  • Stack detected: —
  • License: MIT

AI development signals

No AI development config files detected.