Awesome Bigdata

rescrv/HyperDex

HyperDex is a scalable, searchable key-value store

C++ pushed 2024-05-21 2,439 commits first commit 2011-04-13 1 list mention

★ 1,405

Website ↗ GitHub ↗

Codecademy/EventHub

An open source event analytics platform

Java pushed 2022-04-05 274 commits first commit 2014-01-05 1 list mention

★ 1,337

Website ↗ GitHub ↗

twitter/fatcache

Memcache on SSD

C pushed 2021-11-01 42 commits first commit 2013-02-11 1 list mention archived

★ 1,300

GitHub ↗

uber-archive/AthenaX

SQL-based streaming analytics platform at scale

Java Mavenpip #analytics#calcite#data#flink pushed 2020-06-21 19 commits first commit 2017-10-09 1 list mention archived

★ 1,224

GitHub ↗

twitter/elephant-bird

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Java pushed 2023-04-10 1,374 commits first commit 2010-03-25 1 list mention

★ 1,134

GitHub ↗

krotik/eliasdb

EliasDB a graph-based database.

Go #cluster#clustering#database#embedded pushed 2022-08-14 91 commits first commit 2016-08-13 2 list mentions

★ 1,034

GitHub ↗

TIBCOSoftware/snappydata

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

Scala Gradlepip #analytics#memory-database#scale#snappydata pushed 2022-11-21 4,151 commits first commit 2015-05-13 1 list mention

★ 1,034

Website ↗ GitHub ↗

facebookarchive/bistro

Bistro is a flexible distributed scheduler, a high-performance framework supporting multiple paradigms while retaining ease of configuration, management, and monitoring.

C++ CMake pushed 2023-03-21 9,415 commits first commit 2015-07-06 1 list mention archived

★ 1,026

Website ↗ GitHub ↗

etsy/411

An Alert Management Web Application

PHP #non-sox pushed 2023-04-09 431 commits first commit 2016-08-12 1 list mention archived

★ 968

Website ↗ GitHub ↗

twitter/twemcache

Twemcache is the Twitter Memcached

C pushed 2021-11-01 43 commits first commit 2012-07-10 1 list mention archived

★ 934

Website ↗ GitHub ↗

BIDData/BIDMach

CPU and GPU-accelerated Machine Learning Library

Scala pushed 2022-10-04 3,024 commits first commit 2012-10-22 2 list mentions

★ 919

GitHub ↗

nikolaypavlov/MLPNeuralNet

Fast multilayer perceptron neural network library for iOS and Mac OS X

Objective-C pushed 2016-09-30 141 commits first commit 2013-09-25 2 list mentions

★ 900

GitHub ↗

probcomp/BayesDB

A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself. New implementation in http://github.com/probcomp/bayeslite

pushed 2015-09-24 3,142 commits first commit 2012-11-01 1 list mention

★ 889

Website ↗ GitHub ↗

allegro/hermes

Fast and reliable message broker built on top of Kafka.

Java #hacktoberfest#hermes#kafka#messaging pushed 2026-05-29 2,304 commits first commit 2015-05-15 1 list mention

★ 860

Website ↗ GitHub ↗

akumuli/Akumuli

Time-series database

C++ #c-plus-plus#database#metrics#time-series pushed 2022-08-07 2,362 commits first commit 2013-05-08 1 list mention archived

★ 840

Website ↗ GitHub ↗

Netflix/suro

Netflix's distributed Data Pipeline

Java pushed 2023-04-10 554 commits first commit 2012-03-18 1 list mention archived

★ 797

GitHub ↗

rakam-io/rakam-api

📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)

Java Maven #analytics#analytics-platform#bi-server#big-data pushed 2021-11-13 1,762 commits first commit 2014-01-16 1 list mention

★ 794

Website ↗ GitHub ↗

gazette/core

Build platforms that flexibly mix SQL, batch, and stream processing paradigms

Go #brokers#event-sourcing#golang#stream-processing pushed 2026-05-05 1,048 commits first commit 2015-03-09 1 list mention AI dev signals

★ 790

Website ↗ GitHub ↗

skizzehq/skizze

A probabilistic data structure service and storage

Go pushed 2016-05-10 439 commits first commit 2015-07-31 1 list mention

★ 772

GitHub ↗

jakekgrog/GhostDB

GhostDB is a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

Go #cache#database#datastore#distributed-database pushed 2021-03-10 72 commits first commit 2020-05-19 1 list mention

★ 752

Website ↗ GitHub ↗

benedekrozemberczki/littleballoffur

Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Python #community-structure#deep-learning#forest-fire#graph pushed 2025-12-20 1,441 commits first commit 2020-05-03 3 list mentions

★ 714

Website ↗ GitHub ↗

CSNW/d3.compose

Compose complex, data-driven visualizations from reusable charts and components with d3

JavaScript pushed 2022-12-10 674 commits first commit 2014-03-07 1 list mention

★ 695

Website ↗ GitHub ↗

dalmatinerdb/dalmatinerdb

See gitlab: https://gitlab.com/Project-FiFo/DalmatinerDB/dalmatinerdb

Erlang pushed 2019-02-11 1,007 commits first commit 2014-06-13 1 list mention

★ 692

Website ↗ GitHub ↗

aklivity/zilla

🦎 A multi-protocol edge & service proxy. Seamlessly interface web apps, IoT clients, & microservices to Apache Kafka® via declaratively defined, stateless APIs.

Java #api-gateway#asyncapi#event-driven-architecture#event-stream-proxy pushed 2026-05-31 3,156 commits first commit 2021-12-07 1 list mention AI dev signals

★ 690

Website ↗ GitHub ↗

lucidworks/banana

Banana for Solr - A Port of Kibana

JavaScript Mavennpm pushed 2026-05-28 1,442 commits first commit 2013-01-26 1 list mention

★ 671

GitHub ↗

rax-maas/blueflood

A distributed system designed to ingest and process time series data

Java Mavennpm pushed 2024-08-19 3,152 commits first commit 2013-08-19 1 list mention

★ 598

Website ↗ GitHub ↗

LinkedInAttic/cleo

A flexible, partial, out-of-order and real-time typeahead search library

Java GradleMaven pushed 2013-11-13 57 commits first commit 2011-12-14 1 list mention

★ 568

Website ↗ GitHub ↗

Netflix/PigPen

Map-Reduce for Clojure

Clojure pushed 2023-04-10 716 commits first commit 2012-03-18 2 list mentions

★ 565

GitHub ↗

nathanmarz/elephantdb

Distributed database specialized in exporting key/value data from Hadoop

Java pushed 2014-06-27 704 commits first commit 2010-07-09 1 list mention

★ 559

GitHub ↗

cbd/edis

An Erlang implementation of Redis

Erlang pushed 2015-09-14 436 commits first commit 2010-12-24 1 list mention

★ 519

Website ↗ GitHub ↗

SiriDB/siridb-server

SiriDB is a highly-scalable, robust and super fast time series database. Build from the ground up SiriDB uses a unique mechanism to operate without a global index and allows server resources to be added on the fly. SiriDB's unique query language includes dynamic grouping of time series for easy analysis over large amounts of time series.

C pip #database#siridb#siridb-server#ticker-data pushed 2026-04-13 1,416 commits first commit 2016-03-30 1 list mention

★ 512

Website ↗ GitHub ↗

spring-attic/spring-xd

Spring XD makes it easy to solve common big data problems such as data ingestion and export, real-time analytics, and batch workflow orchestration

Java Spring Boot Gradle pushed 2022-04-04 2,770 commits first commit 2013-04-10 1 list mention archived

★ 477

Website ↗ GitHub ↗

twitter/storehaus

Storehaus is a library that makes it easy to work with asynchronous key value stores

Scala pushed 2020-07-17 960 commits first commit 2013-01-22 1 list mention

★ 465

GitHub ↗

shunfei/indexr

An open-source columnar data format designed for fast & realtime analytic with big data.

Java #columnar-storage#datawarehouse#indexr#olap pushed 2022-11-16 130 commits first commit 2016-12-23 1 list mention

★ 450

GitHub ↗

addthis/hydra

No description.

Java pushed 2020-07-01 2,958 commits first commit 2013-12-27 1 list mention archived

★ 436

GitHub ↗

deroproject/graviton

Graviton Database: ZFS for key-value stores.

Go pushed 2022-01-30 9 commits first commit 2020-09-04 1 list mention

★ 424

GitHub ↗

smooks/smooks

An extensible Java framework for building event-driven applications that break up XML and non-XML data into chunks for data integration

Java #analytics#chunking#enterprise-integration#etl pushed 2025-11-24 854 commits first commit 2010-12-28 1 list mention

★ 416

Website ↗ GitHub ↗

ZEPL/zeppelin

DEPRECATED. Zeppelin has moved to Apache. Please make pull request there

pushed 2017-07-05 1,571 commits first commit 2013-06-19 1 list mention

★ 405

Website ↗ GitHub ↗

brexhq/substation

Substation is a toolkit for routing, normalizing, and enriching security event and audit logs.

Go #automation#aws#logging#monitoring pushed 2026-01-20 250 commits first commit 2022-04-19 1 list mention

★ 400

Website ↗ GitHub ↗

skale-me/skale

High performance distributed data processing engine

JavaScript npmYarn #aws-s3#azure-storage#cluster#machine-learning pushed 2021-05-29 1,718 commits first commit 2014-12-04 1 list mention archived

★ 397

Website ↗ GitHub ↗

NationalSecurityAgency/timely

Accumulo backed time series database

Java #accumulo#hacktoberfest#series-database#time-series pushed 2026-05-13 489 commits first commit 2016-04-21 1 list mention

★ 390

Website ↗ GitHub ↗

rayokota/kareldb

A Relational Database Backed by Apache Kafka

Java pushed 2025-10-15 633 commits first commit 2019-09-06 1 list mention

★ 388

GitHub ↗

xslogic/phoebus

Phoebus is a distributed framework for large scale graph processing written in Erlang.

Erlang pushed 2012-01-15 57 commits first commit 2010-09-24 1 list mention

★ 384

GitHub ↗

danielsdeleo/Decider

Flexible and Extensible Machine Learning in Ruby

Ruby pushed 2017-04-06 197 commits first commit 2009-07-04 1 list mention

★ 383

GitHub ↗

senseidb/zoie

realtime search/indexing system

Java pushed 2022-12-15 524 commits first commit 2010-02-25 1 list mention

★ 371

Website ↗ GitHub ↗

etsy/Conjecture

Scalable Machine Learning in Scalding

Java #non-sox pushed 2018-02-16 155 commits first commit 2014-06-17 2 list mentions archived

★ 360

GitHub ↗

adobe-research/spindle

Next-generation web analytics processing with Scala, Spark, and Parquet.

JavaScript pushed 2015-03-28 118 commits first commit 2014-08-12 1 list mention

★ 330

Website ↗ GitHub ↗

radlab/sparrow

Sparrow scheduling platform (U.C. Berkeley).

Python pushed 2020-07-25 372 commits first commit 2012-01-21 1 list mention archived

★ 328

GitHub ↗

Hydrospheredata/mist

Serverless proxy for Spark cluster

Scala #apache-spark#api#big-data#serverless pushed 2026-04-13 2,020 commits first commit 2016-02-01 2 list mentions

★ 325

Website ↗ GitHub ↗

krestenkrab/hanoidb

Erlang LSM BTree Storage

Erlang pushed 2016-08-07 377 commits first commit 2012-01-04 1 list mention

★ 311

GitHub ↗

Tracked list growth

Likes history

Commits history

Indexed repositories

Filter this list

Put your repository first

How it works

Pricing